Category: hist696

The Scale of Digital

The question of scale is one which all historians must consider in their work. Is the project a microhistory that explains larger movements with representative (or unique) examples? Does it explain the grand movements of history across time and space from a distant perspective instead? All of the projects in this week’s readings engage in the question of scale by allowing users to freely navigate between near and distant readings of their subject. Ed Ayers and Scott Nesbit explain how digital history projects can make the leap between the micro and the macro with their work on Visualizing Emancipation.

Ayers and Nesbit successfully move from the overarching legal and political complexities influencing abolition on a national scale to how this played out among individual slaves in the Shenandoah Valley. I felt their article was very clear and powerful in unpacking the web project. The map was also interesting, though less powerful in its explanation of space and time. There were so many individual events occurring during the time lapse- movements of the Union Army without associated emancipation events, that it was difficult to follow the argument so successfully made in their article. It did allow searching of the metadata of events which was interesting in viewing material related to towns or particular people.

The other projects, ORBIS and Digital Harlem, successfully displayed the power of scale. ORBIS was essentially an argument for scale by literally discussing the time with which information or goods moved around the Roman Empire. By including the farthest cities when analyzing a network, the scale of the empire is clearly stated. Yet beyond the empire’s scale, few other insights are available. It could be useful for a scholar of military history to know the exact time it took for a letter between London and Rome if they were examining decisions in a military campaign. Or to compare the “networks” of different cities in the hinterlands of the Empire. But I suspect the accuracy a scholar would require isn’t served by the best guesses of the creators of ORBIS. I do admit that the map functions look really attractive and would be very useful in education contexts with students from a variety of levels.

I thought Digital Harlem also navigated the issues of scale with elegance. By showing clusters of interaction, users can see both larger historical patterns in Harlem along with the individual intereactions themselves. By including the other uses of each address certain aspects of the neighborhood were pretty successfully recreated. Both the individual scale and larger time scales can be explored.

The question of scale has also been in the forefront of my mind as David Armitrage and Jo Guldi advocate for longer timescales and deeper narratives in order for historians to reclaim their place in public intellectual discussion. They are correct that historians no longer have the ear of policy makers as we once did. Columnists for major papers include economists like Paul Krugman and others, but few to no historians. In The History Manifesto they see longer time and space scales as a way to influence policy discussions. By examining long terms, we can avoid “short termism” (nice Google ngram usage in the introduction). Armitrage and Guldi’s argument links well to the strengths of digital history by allowing the shifting between large timescales, but not obscuring the individual as economists are prone to do.

Networked Judges

We are attempting to employ network tools this week. My “voices” on the Santa Barbara Oil Spill did not seem to connect in useful ways, so I turned to a different source- the Biographical Directory of Federal Judges. This database contains extensive biographical information on more than 3,000 Federal Judges dating back to the colonial era so it is extremely useful. Unfortunately, it is also inordinately messy and untidy– containing over 200 columns (“cleaning” the data was an excercise in Clio3). With my R data cleaning skills and Google’s OpenRefine, I was able to arrive at the columns I would need for some simple network analysis of the judges.

I chose to examine only non-white judges. They made up about 10% of the entire Federal Judgeship. I expected that due to legacies of segregated schooling most of the non-white judges would come from a limited number of traditionally African American colleges and law schools. These networks of judges would provide for their school peers with clerkships and other opportunities. This would be quite common within African American church and business communities. However this did not appear to be the case at all for Federal judges. Of the 368 non-white judges, over 200 schools were represented (my data). Due to the nature of the Directory data, the schools might represent a judge’s undergraduate degree or their law degree, though attending Harvard Law or Harvard College was normalized into Harvard University. Even with this normalization, there were few school “networks’ according to Palladio:

The schools network in RAW was similarly attractive, yet fragmented:

Ultimately, I had major problems with all of the networking applications. They had strict data requirements, were buggy, and some aspects did not work on my computer. Gephi’s data requirements made it unusable for this particular project, though it clearly has some power if one is able to scale it’s steep learning curve. RAW provided code to embed in blog posts, but WordPress stripped all html tags and it appeared to be just a list of colleges (this might be a WP problem instead of a RAW problem). Fortunately, I don’t expect to need much network analysis in my work on the Santa Barbara Oil spill.

I revisited some of my Santa Barbara Voyant visualizations from last week. I had only begun to play with the categories and word frequency visualizations, so I thought I would embed a few more to showcase this tool more fully. Many of these visualizations will make more sense after reading the background to last week’s practicum.

Last week I checked “oil”, “spill” and some other similar words in various word frequency counters. Voyant also mapped these frequencies:

 


It was also interesting, in light of this week’s network analysis, to create see networks between the words in the letters to the editor:


While the network tools evaluate the actual connections between two nodes, Voyant’s collocate tool simply evaluates the network between two word frequencies. While these visualizations look similar, they have very different statistical analysis “inside the box.”

Connecting the Disconnected

Our reading assignments have generally been grouped into a few themes for each week: web presence vs internet history. old vs new definitions of digital history, and practical vs theoretical digitization. This week seemed like a wider variety of material, though each piece falls along the axis of case study, practical tutorial, or theoretical essay. Lauren Klein’s article on James Hemings and his presence/silence in the letters of Thomas Jefferson encompasses all three categories.

The article provides a history via network analysis of someone who was virtually off the network. The history of slavery is filled with similar stories of reclaiming a voice that was silenced by that institution’s systemic repression, but this is a stark case emanating from the digital world. James Hemings’ world story was not the type of existence to be ignored by history. He had a skilled position with which he won his freedom, was literate, and made his way into the historic record. He was almost invisible within the “Papers of Thomas Jefferson.” Yet Klein finds his story. This is an interesting work of digital history theory because the important part of Klein’s work dismantles the structure of digital archive from the piece of correspondence, the letter or document, to the word, in this case James Hemings. We have gone from dismantling the archives in the first weeks to dismantling the document. While dismantling the archive with keyword searching could be dangerous with its false negatives/poor OCR, Klein uses the individual names in letters to provide a view of Jefferson’s slave/free black network in a rigorous, reproducible way.

The article provides a case study because finding the invisible in the archives, even digital archives, is a prototypical task of the historian. Especially when the “archive” is the papers of the writer of the Declaration of Independence and whose papers represent a classic American archive. A common historical task is bringing to the fore a silent historical actor: a slave, an oppressed political actor, or a common worker’s story. Traditionally, historians find these in the liminal spaces on the archives. Klein finds Hemings’ story in these same liminal spaces, but by a digital method. The article veers into a tutorial because she provides the exact steps she took (search for Heming papers > named entity recognition on 51 JH letters > network analysis). Presumably, the python code could easily be made available to be a reproducible case study.

I also appreciated that Klein’s network visualizations were more simple than Kaufman’s analysis of the Kissinger memcons. It takes maturity to only map what is necessary for your argument and no more. According to Scott Weingart, if the visualization is not advancing your argument, it is not appropriate. And here is where Klein succeeds. Her arc diagram does not include the entire Jefferson papers, but the subsets which promote her argument. Either the elites of Virginia or those who mention James Hemings. They effectively show the interaction in Jefferson’s world between the different classes that would be found more or less easily using traditional historical methods.

The other article which particularly interested me was Johanna Drucker’s essay on graphical display in the humanities. This resonated with me as it provides such an interesting antidote to the traditional view of digital data and instead argues that digital humanities embrace capta. Drucker feels that the often ambiguous nature of the humanities, problematizing by design, should be represented similarly in our graphical representations. This is a similar argument made against quantitative historians, so one we should bear in mind. Historians and digital humanists should never forget that one of our important disciplinary strengths is too keep the complexity of history at the forefront. The History Manifesto has encouraged me to compare our work to economists who are so present in the public discourse. An important distinction when considering the complexity of life is that economists work with data. Humanists work with capta.

Word Counters

This week we have transitioned to the tools portion of Clio 1 and are examining “ngram viewers” and text mining. These have a simple process: count words -> visualize. While the tool is simple to grasp, its power comes from the questions asked of it plus additional tweaks that can be performed by each program. To start I needed to get some words that might be important to my project and chart how they changed over time. To get some of these words, I loaded my Santa Barbara Oil Spill’s “voices” (or letters to the editor) into Voyant and outputed every DHer’s first visualization: the word cloud.

The obvious word that sticks out is “oil.” So I started here.

Bookworm’s Chronicling America (ChronAm) and Open Library (OL) start in the 19th century, but are limited by copyright and end in 1921. New York Times Chronicle searches from the 1851 to 2014. The NYT was clearly most useful as it was one of the only journalism sources to include my time period.

 

NYT Chronicle Ngram: Oil

I’m dubious that there were more discussions of oil in the NYT articles 1860s. The likely reasons for this are mistaken OCR (I noticed some mistakenly placed articles), more discussions of kerosene, and a relative lack in the number of articles written at the time.

Yet in reviewing the American Chronicle series of newspapers- a larger collection of mostly regional papers, the spike appears again, this time in 1840.

Bookworm ChronAm: Oil

Also notable is that there are fewer spikes compared to the NY Times, perhaps because of the larger number of papers, and the more scattered nature of regional news sources. Ultimately, this source ends in 1921 which is outside of the period I’m most interested in exploring.

Google Books provides one of the most used examples of Ngrams and also happens to include my time period. It also provides more proof that the initial NYT gram of oil (and the ChronAm, to a degree) were outliers:

It also allows several words to be plotted at once on the same graph. Though “spill” was very important in my word cloud, it was less so in this visualization because it was swallowed up by the scale of the other words. However, the NYT Chronicle ngram provided a different perspective:

NYT Chronicle Ngram: Oil Spill

What’s interesting to me about adding “spill” is that it is so correlated with oil. This indicates that in most articles including “spill”, the word “oil” is also in that article. This only starts to happen in any frequency around 1969. Of course there’s also a major spike around the Exxon Valdez spill, the Ixtoc I spill, and the Deepwater Horizon spill, the 3 other spills to receive major media attention (though Santa Barbara was orders of magnitude lower in terms of tonnes of oil spilled).

Also interesting was that most of the other ngrams did not echo the correlation. Ben Schmidt, who helped develop the Bookworm Ngram, applied it to a corpus of movie and TV scripts to develop Bookworm Movies.

Bookworm Movies:

Ben Schmidt’s Movies Ngram: Oil/Spill

This Ngram can provide a look into popular culture of the time period and unlike many of the “public domain” based ngrams, this has a seemingly opposite year range, starting in 1931 and running to, more or less, the present. In this case, it appeared to have no correlation to the relative peaks and valleys of “oil.” Searching only documentaries for oil did provide some correlation with what I expected to find, though I don’t know there was enough data, nor why the most “documentaries” about oil date from the  1940s.

It’s possible that “spill” is less likely to be used in a book than reporting. There were lots of books on oil, but not many on oil spills per the previous Google Books ngram. Fortunately, Google allows additional techniques not found in the other Ngrams. For example, the user can see what the most common modifier for a word is using the wildcard (*) character. In this example: “* spill”:

Clearly a strong rise in the use of “oil” with “spill” starting in 1967 and rapidly expanding in 1969. There is clearly much to unpack and caveat with these ngram viewers, but they can provide an interesting tool in digital history research.

So far I’ve discussed the programs where the user determines the word to be counted. The other type of ngram-esque text mining program is Voyent. This program is much more powerful in the variety of visualizations that one can reach, but also does not rely on the user for words to be counted: it counts every one. This provides value in situations where a researcher is unfamiliar with the material and needs to get a quick handle on it.

I downloaded the Voyant-Server- allowing users to run offline instances from their local server/machine. This had some great instructions:

Mac: double-click

Windows: blah blah…. but if you do this thing wrong, it might break.

Linux: You don’t really need our help for this, right?

Clearly a project with a sense of humor. I loaded my 14 Letters to the Editor and explored the tools. Voyant is extremely complex in comparison to the ngram viewers, providing many different visualizations and abilities to count the text. Here’s an example of the words around “oil” using Keywords in context:


All of the expected terms are listed: oil spillage, oil drilling, oil company, etc. With the added complexity I wasn’t able to arrive at the same simple argument that I found with the NYT Chronicle and Google Books ngrams, but it is a tool worth putting in effort to explore in order to derive more skillful analysis.

Voyant is also probably the most user friendly in terms of providing output options. The ngram viewers gave either few or no abilities to output the visualizations created by your word counting.

My voyant text can be found here:

http://voyant-tools.org/?corpus=1412316423358.5283&stopList=stop.en.taporware.txt

 

Week 5: Theorizing Text Mining

This week’s readings either used text mining as part of their research methodology or provided theories on its best practices and uses. What most struck me about text mining, and Ted Underwood’s treatment of it, was the connection to complex mathematical algorithms. These are both incredibly complicated- word frequencies can tell us much about a given text, more than we might intuitively imagine- even genre and author, yet applied in a seemingly simple manner- our searches are lumped into a list by “relevance” which has been determined in a complicated manner which is opaque to the researcher. Opaque research methods was echoed in last weeks readings as well, but those using the techniques in their work described their workflows much more clearly.

Ted Underwood described text mining of documents as usually accomplishing one of two tasks: 1. Find patterns in the documents that becomes the literary/historical argument in and of themselves. 2. Use the technique as an exploratory tool to highlight clues where additional work using traditional analysis would prove useful. Gibbs and Cohen, Blevins, and Kaufman’s articles generally fall into the first category.  I felt that Blevins provided the strongest argument based on text mining. He did a great job of both answering the “so what” question by basing his work on theories of “producing space” and “imagined geographies” in Houston. Blevins also did an admirable job of explaining his research process in great detail so that readers can adequately question his methods and research decisions.

Gibbs and Cohen’s article was particularly detailed in explaining its methodologies- apt since it was part of a journal issue revolving around new digital techniques and “distant reading.” Their article even describing the failures of certain terms to be statistically significant in their changes over time for various reasons. In a traditionally researched narrative, the author wouldn’t describe letters that weren’t relevant to their search, but it’s important that digital historians describe their successes, but also where a technique failed them in a surprising way. For example, their search for instances of scientific “fields” and when they became considered organized bodies of knowledge by the Victorians. This didn’t happen until very late in the 19th century so there weren’t really enough entries to find a pattern, the nearly 400 examples could provide new knowledge if given a traditional reading.

Micki Kaufman’s website is a very interesting example of the digital history project. She basically applies the whole tool kit of text mining techniques to the body of Kissenger related memos and summaries of telephone conversations. Some of these techniques were just exploratory- for example creating the 40 categories which the telegrams might fit into. (As a sidenote: is the default number of MALLET categories 40? Is there any reason for this coincidence in Kaufman’s and Nelson’s work?) While Kaufman had some of the most stunning visual elements, her work suffered from the lack of uniting theme. It ultimately was more exploratory than argumentative.

Rob Nelson’s “Mining the Dispatch” provides some methodologies to review what is in the Richmond Dispatch during the Civil War and generally falls into the latter category of exploratory text mining projects. By assigning a variety of categories to the articles, he can see what sort of information Richmonders were reading and the changes to the frequency of those types of articles over time.

I would be curious to see use of Cameron Blevins’ techniques on the Dispatch in providing an “imagined geography” of the City of Richmond. I wonder if there would be much more familiarity with places south of the Mason-Dixon line than immediately north, even if those areas were geographically much closer (Baltimore, Maryland vs. Montgomery, Alabama). As the digital history field develops, hopefully we can sub in successful methods, like Blevins’, on different bodies of documents. This could lead to more standard digital workflows (as opposed to standard tools) which I think could benefit the field as a whole.

404 Error: Database Not Found

For over 20 years, students, scholars, and the general public have been enjoying the benefits of fulltext search and formerly paper documents. Many workflows might look something like this: Google->Wikipedia->JSTOR->Chronicling America/Proquest historic newspapers->Digital Dissertations->digitized archives->physical archives->fill in gaps from other digital sources->add something “transnational” from that digitized French archives that you could never visit otherwise. Collect everything relevant and write it up.

Yet, in a review of three years of Environmental History, the primary journal of global Environmental History scholarship, articles cited very few explicit mentions of this now common workflow. There were a handful of references to online material- a blog post, an article in a digital encylopedia, or even a reference to a particular longitude/latitude on Google Earth. But unsurprisingly, most citations were to books (likely read in physical form, but what about pre-1923 Google books/Open books/etc), physical articles (probably read via JSTOR/Proquest/etc), or archival citations (very likely to still be physical archives, but could be online). That they didn’t cite to what was actually consulted is a problem as a historian’s methodology is very important to their authority as a scholar.

But this post is less about silence towards the database and more about explicit references to databases or online sources in the articles of the last 3 years. Two references to this type of work were particularly interesting in representing the new scholarship, John McNeil’s Presidential Address: Toynbee as Environmental Historian, and Daniel Simberloff’s “Integrity, Stability, and Beauty: Aldo Leopold’s Evolving View of Nonnative Species”.

McNeil’s address concerned Arnold Toynbee, (in)famous public intellectual and historian, as a proto-enviromental historian. And on what “big” histories could contribute to the field as a whole. In preparing the address/article, McNeil “read only 2%” of [Toynbee’s] output” (10-15 million words). He laments:

When I chose the subject for this address, I ignorantly assumed that Toynbee’s works would be available via Google books and I could instantly locate all the passages that use words such as environment or ecology. By the time I learned that I would have to work from the printed pages, it was too late to change my plans. But happily Veronica Boulter Toynbee, who worked as hard as her husband and a bit more carefully, prepared the indexes for all his major books.

The full text search is so ingrained in our methods that even the President of the American Society for Environmental History assumes this in his workflow. Had the books been digitized, I doubt McNeil would have completely described how he arrived at the requisite references. Perhaps giving us search terms and insight into the relevant passages as he does in his address, but not likely informing us of the OCR accuracy, any false positives, or false negatives.

The other interesting example comes from Simberhoff’s article on Aldo Leopold. I don’t mean to pick on Simberhoff, who has written an interesting article exploring one of the conservation movement’s important figures and his views on diversity vs stability in academic ecology. But Simberhoff appears to exclusively derive much of his archival material, beyond Leopold’s books, on those in the University of Wisconsin’s digitized Aldo Leopold collection.

The UW-M  collection has a variety of methods to approach it’s digital works. It helpfully provides some guidelines:

Most users with a scholarly or general interest in Aldo Leopold will find that the Detailed Contents List provides the best access to the collection. It describes each file series and, within each series, each box and folder in the collection, and there are links from the description of each folder directly to the digitized material in those folders.

Alternatively, readers can use full search capability (it doesn’t list the OCR accuracy rate or additional processing). This method is best for those “who are primarily interested in whether Leopold had any connection with a particular person, place, or topic.” Though this does not include his handwritten correspondence. Simberhoff does not describe which method he took in researching Leopold and this is a problem for the reader. Did Simberhoff ignore the handwritten material while performing a full text search for “nonnative” or “ecology”? Or did he review the archive front to back?

These are serious questions that our discipline needs to answer as the archive may literally look different each time we approach it.

Reproducible Research

This week’s readings concerned databases and the elegent, reproducible methods of getting historical knowledge out of them. Many of the themes were very familiar to me and echoed a graduate seminar on local archives I’d taken in the past. The questions of authority, how to extract knowledge, and the role of the historian/actual work of the historian were at play in archives, both digital and physical.

Tim Hancock’s chapter covered a number of interesting topics, but I fixated on his casting of the historian as an archives “expert.” The historian would toil at the archive, expertly pulling substantive examples to support their work from a huge collection of non-relevant material. The work would then be published, launching others to build upon the nuggets of information mined from the opaque archive. In the digital archive, anyone can search and unearth their own information, democratizing the historical process (which most agree is one benefit of the digital).

But I wondered where this leads the professional historian. If the public doesn’t require someone with the peculiar expertise to determine the “correct” evidence, what is the point of our discipline and its requisite training? This was answered by Lara Putnam in “The Transnational and the Text Searchable.” Putnam shows that where historians had to search in the liminal spaces for their research topics,  “against the grain” from the “historia patria” at the heart of many national archives’ missions, historians must still do so in their digital searches. For searching online can provide you the documents but it often doesn’t provide the context or the local knowledge of a physical archive.

The hazards of new digital methods were also echoed in McDaniel, Mussel, Nicholson, and Spedding. This is where the professional historian can provide value over the amateur. In gleaning the meaning hidden in the cracks of the databases, understanding the technologies underlying the documents (the limits of OCR, for example), and providing reproducible research methods, a professional historian still provide value in the pursuit of the past.

One of McDaniels points about reproducible research methods was the cite the electronic version of a document if you actually used the electronic version. In her 2014 article, Lara Putnam does so, but still has broken links. The link is broken because she mistyped the link to this page, but the reader might not be able to find an object without conducting their own search. This speaks to one of the major problems with citing to electronic objects, they might not exist in that spot in perpetuity like a physical book with an LC call number. So this is the other place where professional historians will need to evolve our methods to provide truly reproducible research that ensures our peers or the public can follow us through the evidence we find.

But the readings don’t quite address a reproducible workflow from start to finish- a set of instructions- 1. go to X database 2. use Y search term  3. organize the data in Z way.  One of the strengths of computers is that they run a set of instructions the same way given the same parameters and inputs, so digital history should leverage this ability. It would definitely look more like William Turkel’s vision of history and be barely recognizable to those who rely exclusively on physical documents in archives for their research. Yet the databases *are* different and we need to recognize that fact and leverage it where it’s useful. To search across traditional colllections, as in Nicholson’s media culture history, and beware of the possible pitfalls of divorcing documents from their sense of place as Putnam points out. Manovich is correct that databases utilize a sense of narrative that is unique from traditional history, so the discipline can’t continue to pretend we work the same way we always have.

Optical Character (mis)Recognition

The process of using optical character recognition to make a series of texts machine-readable is one that I’m learning about for my final project in Clio 3. This means that successfully generating the text for one page should be simple and straightforward. While the process has been relatively simple, it has been difficult to achieve great results.

My first attempt at OCR was on Professor Robertson’s research sample. This was the first page of a hearing that he photographed at the Pinkerton Detective archive. The photos are in color with decent resolution. They are rotated 90 degrees so they fit the landscape orientation of the camera. Rotating 90 degrees was the first change I made to the photograph as OCR requires correctly oriented lines of text. I used google drive to OCR this image which was very simple once I had my settings correct. It sadly took me a rather long time to realize that Google’s OCR output is a google doc with your image first, and the text below  it. I finally realized that the text below the image was supposed to represent what was on the image. There were clearly major problems with this P7150001-fixed.JPG.

First, the text on the subsequent pages was partly bleeding through due to the very thin nature of the paper (many archivists aptly call that type of paper “onion skin”). Another source of errors was that the photos were in color, so the contrast between the brown/tan paper and the dark brown text was less than if it were black and white. This can aid the human eye to figure out difficult to read words, but hinders the computer eye. So I transformed the photo into black and white with higher contrast- P7150001-bw.jpg.

This helped increase the contrast between the text and the page and reduced the bleed through effect of the thin paper. Yet my OCR actually became Worse! Another problem was non-characters in the document, like dashed lines in the heading and shadows at the crumpled edges of pages around the outside of the document. These contributed to OCR errors when the machine tried to read the “suspicious characters.” I couldn’t eliminate the border around the title, but I  could crop the image to eliminate the edges- P7150001-crop.jpg. This reduced errors and made a readable sentence, even though it missed significant parts of the text. The improvements did not help achieve perfectly formatted text, but did possibly improve the results.

I could have used the techniques required of the Pinkerton text: align, increase contrast, and crop, on the 100s of legal documents, but this would take a serious amount of time. So I harnessed the power of my computer and the code of others for my purposes.

A little bit about my process for OCRing. I used <wget> to download all of the Indian Claims Commission cases from the Oklahoma State University website. These exist as un-OCR’d pdfs of the Decisions which range from the late 1940s to late 1970s- seemingly perfect fodder for a good OCR process. I employed Civil-Procedure-Codes, a package created by Kellen Funk and modified by Lincoln Mullen to OCR the pdfs. This package uses tesseract-ocr, which is considered one of the best open source options. The program also automatically did some of the processes I used on the single page. It created bitonal tiff files out of the PDFs because tesseract does better with this file type. It then output the OCR as structured text files.

With such a sophisticated program, I should get even better results. Yet, I counted the first 100 words and of these, 65 were correct for an error rate of 35%. Of these, 29 were articles, pronouns, prepositions, or single letter initials. These are all words that I would not likely be searching for in a scholarly historical search. The software did get significantly better at correctly reading the text (or perhaps the later pages were more clear). So the final error rate was lower than 35% of the first 100 words.

The other exercise for the week was to evaluate the Chronicling America newspaper OCR at the Library of Congress. The LOC helpfully allows you to see the raw OCR (they make no attempts to process their output text). Many sites keep the OCR text layer hidden so it is difficult to actually evaluate the site’s data quality. I first viewed the front page article- “100 years ago today.” The Daily Missoulian’s OCR is here. Right off the bat, it misses the title of the front page article- “Montana Progressives Indorse Compensation Act” reads as “Montana P. R ESSIVE DOR Compensation Act.” This gets many words, but misses the word which many might use as a search term- “Progressives.”

I also wanted to test the OCR on one of the older newspapers. I searched for the oldest example of “swamp” in any Washington DC newspapers and found the Daily Evening Star from 1852. The OCR only had 4 errors out of the first 40 words I checked around “swamp”, a very good accuracy rate. It’s possible this was not digitized from microfilm as the image quality was fantastic. I also wanted to check against a very modern example, the 1922 Washington Evening Star. This example of “swamp“, was also extremely well OCR’d with an equal mistake rate as the 1852 paper. Perhaps the Washington Evening Star is an outlier compared to other Chronicling America papers, but the image quality and resulting OCR text was at a very high level. After my difficulties with transforming some of my own material with what I thought was a state of the art package, I assumed this would be at a similarly low level. Yet, the Washington papers were a pleasant surprise.

When I was signing up for an account recently, it required a captcha that was so incredibly complicated with lines, twists, blurs and distortions that, it took me dozens of attempts to suceed. Clearly historians need to enlist the creators of spam bots, who are constantly innovating in character recognition and defeating simple captchas, in creating better OCR that can unlock the scanned documents of the past.

Under the Hood: Digitization

The building blocks of historical scholarship are documents. This is no different for digital historians with the distinction that these documents must be transformed into a machine-readable format (if not “born digital”). Week Three’s readings attempted to provide additional context into how this happens and what historians can gain or lose in the transformation.

Cohen and Rosenzweig’s chapter and Tanner’s article detail the real costs of digitization projects, both in terms of time/money, as well as different techniques available to turn a physical page into one on a computer screen. These readings are both more than 8 years old, yet Tanner is still cited by Ian Milligan’s 2013 article (about the affects of Optical Character Recognition on the practice of Canadian History). Clearly, the state of digitization has stagnated to a certain extent compared to other technologies (consider where mobile phones were in 2004 vs today).

When reading Tanner, I was immediately suspicious of the authority and quality of advice that a professional consulting service might offer. Yet, Tanner did not hawk his services or insist that the subject was too difficult to be understood by the layperson and better left to KDCS. His report was a clear and concise description of OCR and this is likely why it is still being cited. As a sidenote, the KDCS site doesn’t seem to list Tanner’s report, but instead links to an open source work by Cornell University as the best introduction to digitization. I feel like this is a little surprising given they’re supposed to be the “expert” consultant, but certainly refreshing compared to the products of many consultants.

The advice on digitization projects is important as there are many different levels of digitization from a scanned image to image+OCR, structured OCR text, and OCR text+XML markup. Both Tanner and Cohen/Rosenzweig do a great job of pointing out the expenses that each of these methods entail. It is fascinating, as Cohen and Rosenzweig report, that a re-keyed document might be both more correct and less expensive when compared to manually fixing poor OCR.

On the creation side of digitization, there are many considerations for the would-be scanner to bear in mind. There are just as many considerations for the users of digitized materials. Ian Milligan did a fantastic job in explaining how Canadian historians have changed the topics they pursue and the sources they use since the digitization of several Canadian newspapers in 2005. This change was foretold by Ayers, Cohen, and Rosenzweig in their writings on the nature of digital history, but it is interesting to see how it has actually affected the scholarship. Marlene Manoff is right, “the medium is the message” and historians are increasingly finding the paths of least resistance, for better or worse.

While many of Milligan’s arguments are directed at the accessibility of online newspapers affecting dissertation citations, Bob Nicholson has used digitized newspapers to perform an entirely new type of cultural history research. By searching across all articles (Text-> Title/Date/Newspaper) instead of the typical hierarchical search (Newspaper -> Date -> Title -> Text), Nicholson can see how his research topic, American phrases in common British usage, changed over time. This method of searching OCR’d images for specific terms was precisely what O’Malley and Takats discussed as a potential way to replicate their own cultural history research, done the hard way in French archives or in reading dozens of American newspapers. While O’Malley, Takats, and Nicholson ended with similar products, Nicholson was much more efficient in his use of  OCR’d newspapers. Thus it was a true digital history project, in my opinion.

Both creators of digitized material and their users need to keep in mind the materiality of the digital objects. Conway points out the many decisions made in reproducing photographs. When viewing a digital photo or a photo of an object in a museum, one is actually just viewing a representation created under certain viewing conditions, lighting, or amount of cropping. The creator or scanner must attempt to recreate the object with the author’s intent in mind, but the user must also remember that it has gone through more than one level of processing and changes to the message occur at each level of processing.

As a user and future creator of digital objects, I better appreciate the many decisions that influence and affect a digitized product. With this insight, I will need to use a variety of search techniques to minimize the affect of “suspicious characters” (OCR mistakes) on my web searches or textual analysis. I won’t immediately go to structured XML text as my OCR technique of choice now that I know the significant hurdles behind it once the uncorrected OCR text is created (though I now appreciate the XML work done by the Walt Whitman Archive and others). After going “under the hood” of OCR and digitization, there is a serious level of complexity and thoughtfulness required in order to create the building blocks of future digital history.

Environmental History Online

This week we reviewed what already exists online for our particular subjects and potential Clio 1 project topics. This will help us understand how our projects might use these resources, and where our work will fit into existing digital scholarship.

I began by reviewing the general online collections related to Environmental History in the 20th century. This led me to a variety of interesting sources, stories, reference sites, and general information, but few digital history projects or corpus of documents. One interesting project included the “Land Use History of the Colorado Plateau”, which explored the Colorado Plateau from a variety of disciplines (it was also interesting as an example of early academic projects- published 2002).

Another multidisciplinary project was the 1969 Santa Barbara Oil Spill published by the University of California- Santa Barbara Geography Department, but including historical photographs, material from the Marine Protected Areas project, articles putting the spill in context, and other resources.

What struck me about this project was the relative lack of contributions from historians. As such, I thought this might make for an interesting project for which there is little historical scholarship online. I searched for more resources and found Darren Hardy’s “1969 Santa Barbara Oil Spill” project which was created while he received his Masters in Environmental Science and Management. Though much of the website reuses articles also linked at the UCSB Geography Department’s website, he transcribed quite a few letters to the editor about the oil spill in 1969. Even more useful is that XML versions of these letters exist on his website and might be suitable for easy text analysis.

Following the google search, I examined other major websites of historical documents not necessarily indexed by google. The Library of Congress has an enormous amount of material, including collections on the American West, historic newspapers, and American Memory. Unfortunately, the American West collection is almost entirely pre-1920 and there is little on the Santa Barbara Oil Spill of 1969.

I also found some interesting newsreel footage of the oil spill from the Prelinger Archives (indexed by the Internet Archive). The footage is quite long and silent, so it is not particularly interesting in its current form. However, the Prelinger Collection has the most permissive copyright license- a Creative Commons Public Domain license. This allows the footage to be used and reused in any way desired. While I don’t envision much video in my final project, it could be a useful part of a website if remixed in an interesting way (perhaps this is a Clio 2 project).

The Center for the American West at UC-Boulder had quite a few publications on oil/energy topics, but focused on oil produced by hydraulic fracturing. The Digital Public Library of America (or its partners) contained few recent sources, but did have the text of a Senate hearing on the Santa Barbara Oil Spill which could be interesting to compare to Letters to the Editor from Darren Handy’s web project.

In this search I’ve learned much about what environmental history resources that exist online. While I’m excited about the possibilities of exploring the Santa Barbara oil spill of 1969, there are far fewer general resources than I expected. Relevent to post-WWII environmental history there are none of the Google Books public domain works, little in historic newspaper collections, and fewer of the large scale digitization projects which have affected legal history or literature. I expect that one of the strengths of digital environmental history might be it’s ability to be mapped and the large amounts of publicly available data provided by government agencies like the USGS, FSA, NPS, etc. But that is for a future project. In the meantime I will focus on Santa Barbara and GIS shape files from the California Marine Protected Areas Database.