This week’s readings either used text mining as part of their research methodology or provided theories on its best practices and uses. What most struck me about text mining, and Ted Underwood’s treatment of it, was the connection to complex mathematical algorithms. These are both incredibly complicated- word frequencies can tell us much about a given text, more than we might intuitively imagine- even genre and author, yet applied in a seemingly simple manner- our searches are lumped into a list by “relevance” which has been determined in a complicated manner which is opaque to the researcher. Opaque research methods was echoed in last weeks readings as well, but those using the techniques in their work described their workflows much more clearly.
Ted Underwood described text mining of documents as usually accomplishing one of two tasks: 1. Find patterns in the documents that becomes the literary/historical argument in and of themselves. 2. Use the technique as an exploratory tool to highlight clues where additional work using traditional analysis would prove useful. Gibbs and Cohen, Blevins, and Kaufman’s articles generally fall into the first category. I felt that Blevins provided the strongest argument based on text mining. He did a great job of both answering the “so what” question by basing his work on theories of “producing space” and “imagined geographies” in Houston. Blevins also did an admirable job of explaining his research process in great detail so that readers can adequately question his methods and research decisions.
Gibbs and Cohen’s article was particularly detailed in explaining its methodologies- apt since it was part of a journal issue revolving around new digital techniques and “distant reading.” Their article even describing the failures of certain terms to be statistically significant in their changes over time for various reasons. In a traditionally researched narrative, the author wouldn’t describe letters that weren’t relevant to their search, but it’s important that digital historians describe their successes, but also where a technique failed them in a surprising way. For example, their search for instances of scientific “fields” and when they became considered organized bodies of knowledge by the Victorians. This didn’t happen until very late in the 19th century so there weren’t really enough entries to find a pattern, the nearly 400 examples could provide new knowledge if given a traditional reading.
Micki Kaufman’s website is a very interesting example of the digital history project. She basically applies the whole tool kit of text mining techniques to the body of Kissenger related memos and summaries of telephone conversations. Some of these techniques were just exploratory- for example creating the 40 categories which the telegrams might fit into. (As a sidenote: is the default number of MALLET categories 40? Is there any reason for this coincidence in Kaufman’s and Nelson’s work?) While Kaufman had some of the most stunning visual elements, her work suffered from the lack of uniting theme. It ultimately was more exploratory than argumentative.
Rob Nelson’s “Mining the Dispatch” provides some methodologies to review what is in the Richmond Dispatch during the Civil War and generally falls into the latter category of exploratory text mining projects. By assigning a variety of categories to the articles, he can see what sort of information Richmonders were reading and the changes to the frequency of those types of articles over time.
I would be curious to see use of Cameron Blevins’ techniques on the Dispatch in providing an “imagined geography” of the City of Richmond. I wonder if there would be much more familiarity with places south of the Mason-Dixon line than immediately north, even if those areas were geographically much closer (Baltimore, Maryland vs. Montgomery, Alabama). As the digital history field develops, hopefully we can sub in successful methods, like Blevins’, on different bodies of documents. This could lead to more standard digital workflows (as opposed to standard tools) which I think could benefit the field as a whole.