Under the Hood: Digitization

The building blocks of historical scholarship are documents. This is no different for digital historians with the distinction that these documents must be transformed into a machine-readable format (if not “born digital”). Week Three’s readings attempted to provide additional context into how this happens and what historians can gain or lose in the transformation.

Cohen and Rosenzweig’s chapter and Tanner’s article detail the real costs of digitization projects, both in terms of time/money, as well as different techniques available to turn a physical page into one on a computer screen. These readings are both more than 8 years old, yet Tanner is still cited by Ian Milligan’s 2013 article (about the affects of Optical Character Recognition on the practice of Canadian History). Clearly, the state of digitization has stagnated to a certain extent compared to other technologies (consider where mobile phones were in 2004 vs today).

When reading Tanner, I was immediately suspicious of the authority and quality of advice that a professional consulting service might offer. Yet, Tanner did not hawk his services or insist that the subject was too difficult to be understood by the layperson and better left to KDCS. His report was a clear and concise description of OCR and this is likely why it is still being cited. As a sidenote, the KDCS site doesn’t seem to list Tanner’s report, but instead links to an open source work by Cornell University as the best introduction to digitization. I feel like this is a little surprising given they’re supposed to be the “expert” consultant, but certainly refreshing compared to the products of many consultants.

The advice on digitization projects is important as there are many different levels of digitization from a scanned image to image+OCR, structured OCR text, and OCR text+XML markup. Both Tanner and Cohen/Rosenzweig do a great job of pointing out the expenses that each of these methods entail. It is fascinating, as Cohen and Rosenzweig report, that a re-keyed document might be both more correct and less expensive when compared to manually fixing poor OCR.

On the creation side of digitization, there are many considerations for the would-be scanner to bear in mind. There are just as many considerations for the users of digitized materials. Ian Milligan did a fantastic job in explaining how Canadian historians have changed the topics they pursue and the sources they use since the digitization of several Canadian newspapers in 2005. This change was foretold by Ayers, Cohen, and Rosenzweig in their writings on the nature of digital history, but it is interesting to see how it has actually affected the scholarship. Marlene Manoff is right, “the medium is the message” and historians are increasingly finding the paths of least resistance, for better or worse.

While many of Milligan’s arguments are directed at the accessibility of online newspapers affecting dissertation citations, Bob Nicholson has used digitized newspapers to perform an entirely new type of cultural history research. By searching across all articles (Text-> Title/Date/Newspaper) instead of the typical hierarchical search (Newspaper -> Date -> Title -> Text), Nicholson can see how his research topic, American phrases in common British usage, changed over time. This method of searching OCR’d images for specific terms was precisely what O’Malley and Takats discussed as a potential way to replicate their own cultural history research, done the hard way in French archives or in reading dozens of American newspapers. While O’Malley, Takats, and Nicholson ended with similar products, Nicholson was much more efficient in his use of  OCR’d newspapers. Thus it was a true digital history project, in my opinion.

Both creators of digitized material and their users need to keep in mind the materiality of the digital objects. Conway points out the many decisions made in reproducing photographs. When viewing a digital photo or a photo of an object in a museum, one is actually just viewing a representation created under certain viewing conditions, lighting, or amount of cropping. The creator or scanner must attempt to recreate the object with the author’s intent in mind, but the user must also remember that it has gone through more than one level of processing and changes to the message occur at each level of processing.

As a user and future creator of digital objects, I better appreciate the many decisions that influence and affect a digitized product. With this insight, I will need to use a variety of search techniques to minimize the affect of “suspicious characters” (OCR mistakes) on my web searches or textual analysis. I won’t immediately go to structured XML text as my OCR technique of choice now that I know the significant hurdles behind it once the uncorrected OCR text is created (though I now appreciate the XML work done by the Walt Whitman Archive and others). After going “under the hood” of OCR and digitization, there is a serious level of complexity and thoughtfulness required in order to create the building blocks of future digital history.

Leave a Reply

Your email address will not be published. Required fields are marked *