Optical Character (mis)Recognition

The process of using optical character recognition to make a series of texts machine-readable is one that I’m learning about for my final project in Clio 3. This means that successfully generating the text for one page should be simple and straightforward. While the process has been relatively simple, it has been difficult to achieve great results.

My first attempt at OCR was on Professor Robertson’s research sample. This was the first page of a hearing that he photographed at the Pinkerton Detective archive. The photos are in color with decent resolution. They are rotated 90 degrees so they fit the landscape orientation of the camera. Rotating 90 degrees was the first change I made to the photograph as OCR requires correctly oriented lines of text. I used google drive to OCR this image which was very simple once I had my settings correct. It sadly took me a rather long time to realize that Google’s OCR output is a google doc with your image first, and the text below  it. I finally realized that the text below the image was supposed to represent what was on the image. There were clearly major problems with this P7150001-fixed.JPG.

First, the text on the subsequent pages was partly bleeding through due to the very thin nature of the paper (many archivists aptly call that type of paper “onion skin”). Another source of errors was that the photos were in color, so the contrast between the brown/tan paper and the dark brown text was less than if it were black and white. This can aid the human eye to figure out difficult to read words, but hinders the computer eye. So I transformed the photo into black and white with higher contrast- P7150001-bw.jpg.

This helped increase the contrast between the text and the page and reduced the bleed through effect of the thin paper. Yet my OCR actually became Worse! Another problem was non-characters in the document, like dashed lines in the heading and shadows at the crumpled edges of pages around the outside of the document. These contributed to OCR errors when the machine tried to read the “suspicious characters.” I couldn’t eliminate the border around the title, but I  could crop the image to eliminate the edges- P7150001-crop.jpg. This reduced errors and made a readable sentence, even though it missed significant parts of the text. The improvements did not help achieve perfectly formatted text, but did possibly improve the results.

I could have used the techniques required of the Pinkerton text: align, increase contrast, and crop, on the 100s of legal documents, but this would take a serious amount of time. So I harnessed the power of my computer and the code of others for my purposes.

A little bit about my process for OCRing. I used <wget> to download all of the Indian Claims Commission cases from the Oklahoma State University website. These exist as un-OCR’d pdfs of the Decisions which range from the late 1940s to late 1970s- seemingly perfect fodder for a good OCR process. I employed Civil-Procedure-Codes, a package created by Kellen Funk and modified by Lincoln Mullen to OCR the pdfs. This package uses tesseract-ocr, which is considered one of the best open source options. The program also automatically did some of the processes I used on the single page. It created bitonal tiff files out of the PDFs because tesseract does better with this file type. It then output the OCR as structured text files.

With such a sophisticated program, I should get even better results. Yet, I counted the first 100 words and of these, 65 were correct for an error rate of 35%. Of these, 29 were articles, pronouns, prepositions, or single letter initials. These are all words that I would not likely be searching for in a scholarly historical search. The software did get significantly better at correctly reading the text (or perhaps the later pages were more clear). So the final error rate was lower than 35% of the first 100 words.

The other exercise for the week was to evaluate the Chronicling America newspaper OCR at the Library of Congress. The LOC helpfully allows you to see the raw OCR (they make no attempts to process their output text). Many sites keep the OCR text layer hidden so it is difficult to actually evaluate the site’s data quality. I first viewed the front page article- “100 years ago today.” The Daily Missoulian’s OCR is here. Right off the bat, it misses the title of the front page article- “Montana Progressives Indorse Compensation Act” reads as “Montana P. R ESSIVE DOR Compensation Act.” This gets many words, but misses the word which many might use as a search term- “Progressives.”

I also wanted to test the OCR on one of the older newspapers. I searched for the oldest example of “swamp” in any Washington DC newspapers and found the Daily Evening Star from 1852. The OCR only had 4 errors out of the first 40 words I checked around “swamp”, a very good accuracy rate. It’s possible this was not digitized from microfilm as the image quality was fantastic. I also wanted to check against a very modern example, the 1922 Washington Evening Star. This example of “swamp“, was also extremely well OCR’d with an equal mistake rate as the 1852 paper. Perhaps the Washington Evening Star is an outlier compared to other Chronicling America papers, but the image quality and resulting OCR text was at a very high level. After my difficulties with transforming some of my own material with what I thought was a state of the art package, I assumed this would be at a similarly low level. Yet, the Washington papers were a pleasant surprise.

When I was signing up for an account recently, it required a captcha that was so incredibly complicated with lines, twists, blurs and distortions that, it took me dozens of attempts to suceed. Clearly historians need to enlist the creators of spam bots, who are constantly innovating in character recognition and defeating simple captchas, in creating better OCR that can unlock the scanned documents of the past.

Leave a Reply

Your email address will not be published. Required fields are marked *