It seems simple: upload a half dozen objects to an online exhibit. Yet there are so many more decisions to consider. How to categorize the object, navigate the online copyright landscape, and properly present objects each required much thoughtful decision-making.
I had initially planned to upload transcribed letters to the editor of the Santa Barbara News-Press. These “voices” and my analysis of them will be my final Clio 1 project. But who own a transcription of a letter posted to public newspaper space? Does the original letter writer? The newspaper? The transcriber? I have requested permission from the transcriber- Darren Hardy with the thought that he “must have” navigated this complex legal environment to post his versions online. In reality, he likely has not because neither the News-Press, nor the letter writers (or their heirs) have bothered to protect their rights in court….. BUT copyright law is out of my area of expertise and the scope of this post.
After determining that with Darren’s permission, I would post away.. I attempted to upload each XML file in Omeka. Unfortunately XML files are not a permitted type of file in Omeka. What the CMS giveth, the CMS taketh away. After attempting to transform the XML files to something more useful, I decided to upload some of my own personal photographs, though I would keep the “Santa Barbara Oil Spill” name for future work on that project. The photos came from a tour of Monticello that I chaperoned back in 2008 and represent a sort of prototypical public history experience.
Uploading my own material allowed for easier metadata creation. I didn’t need to place uploader (me), letter writers, newspaper editors, and transcribers into “creators”, “publishers”, “contributors”, or “source” categories. Yet there were still difficulties. Does a photograph of an object get a “language” (I decided no). A photo with English text (yes). Dates needed properly formatted descriptions (YYYY-MM-DD) and consistency between data was key.
Once the items were described and thankfully uploaded, I needed to place them into an exhibit. This was also slightly trickier to navigate in Omeka than I expected. Fortunately, all the work of adding images and metadata made the exhibition work much simpler once I knew which buttons created pages and their association with objects. I organized the pages into a vaguely narrative format, though the user could navigate in any order.
Last week, I worked to create a network of non-white judges. i theorized that Federal Judges would come from the same networks of schools, often historically black Colleges and Universities, due to increased segregated schooling in the US South. I was incorrect as there were few networks and over 200 different schools represented among the 368 Judges.
This week, I decided to map the birthplaces and schools of Federal non-white judges to determine if additional patterns might arise.
For best results, I’d recommend zooming into the map or maximizing it in a new window and viewing the United States or particular US regions (though internationally born judges do provide some interesting data points). I simply added the birth place columns (City, State) and School. I gave these a spectrum of colors (5 ranges) to indicate change over time by birth year. Colleges received red colors and birth places yellow shades to distinguish them from one another.
I expect the higher proportion of African American judges born in the South would mirror population concentrations and likewise for Hispanic Judges born in Texas and California. One interesting fact exposed by the map is the great distance that many judges seemed to travel between birth place and school. It would be interesting to see if the same held for white Judges.
Unfortunately, Google doesn’t provide this type of functionality beyond manually calculating distance for each judge.
Mapping the judges was incredibly easy, because I’d done the work last week to clean up the data first. I simply loaded the CSV into Google and played with the settings for a few minutes, chose the light political basemap to highlight the data: Presto! A decent map.
But what makes this easy- the lack of options, also creates some mapping problems. The college networks are further de-emphasized because each college receives only one “pin”, even though Howard, for example, has at least 12 Federal Judge alumni. Birth places suffered from the same issue (e.g. New York, NY babies). Using a method to “weight” the number of alumni or hometowns would provide a more accurate representation of Judge patterns.
The schools data was also more suspect than the birth place data. I’m unsure why Brown University, for example, is located over the University of Virginia. Why Google placed Jersey State College over State College, Pennsylvania is more obvious, but difficult to fix without manually changing the data.
Ultimately, Google provides an easy and effective method when the data exists and is in the proper format. A project for Clio 3 also illustrates the benefits of previously clean data and building on the work of others.
I’m attempting to analyze and possibly map Decisions of the Indian Claims Commission. Fortunately, there are some fantastic geographers who have come before me and geo-rectified Charles Royce’s maps (1899). Charles Royce created maps of Indian land title ceded to the Federal government Nicole Smith first created the georectified maps, sharing them as shapefiles. Matthew McCarthy, a George Mason University Geography Dept graduate student, expanded upon her work, adding significant geographic analysis. His maps also includes some helpful features like present-day reservations. My map is currently far more primitive (and inaccurate in some ways), though I hope it will provide the bones on which future textual analysis might rest:
Using more powerful tools (open source QGIS), I’ve created a custom map that is not as “pretty” as Google’s, but will provide a more robust platform for future Work. And with good data (provided by kind souls), it’s already effective in showing patterns of Western migration and Indian land cession.
We are attempting to employ network tools this week. My “voices” on the Santa Barbara Oil Spill did not seem to connect in useful ways, so I turned to a different source- the Biographical Directory of Federal Judges. This database contains extensive biographical information on more than 3,000 Federal Judges dating back to the colonial era so it is extremely useful. Unfortunately, it is also inordinately messy and untidy– containing over 200 columns (“cleaning” the data was an excercise in Clio3). With my R data cleaning skills and Google’s OpenRefine, I was able to arrive at the columns I would need for some simple network analysis of the judges.
I chose to examine only non-white judges. They made up about 10% of the entire Federal Judgeship. I expected that due to legacies of segregated schooling most of the non-white judges would come from a limited number of traditionally African American colleges and law schools. These networks of judges would provide for their school peers with clerkships and other opportunities. This would be quite common within African American church and business communities. However this did not appear to be the case at all for Federal judges. Of the 368 non-white judges, over 200 schools were represented (my data). Due to the nature of the Directory data, the schools might represent a judge’s undergraduate degree or their law degree, though attending Harvard Law or Harvard College was normalized into Harvard University. Even with this normalization, there were few school “networks’ according to Palladio:
The schools network in RAW was similarly attractive, yet fragmented:
Ultimately, I had major problems with all of the networking applications. They had strict data requirements, were buggy, and some aspects did not work on my computer. Gephi’s data requirements made it unusable for this particular project, though it clearly has some power if one is able to scale it’s steep learning curve. RAW provided code to embed in blog posts, but WordPress stripped all html tags and it appeared to be just a list of colleges (this might be a WP problem instead of a RAW problem). Fortunately, I don’t expect to need much network analysis in my work on the Santa Barbara Oil spill.
I revisited some of my Santa Barbara Voyant visualizations from last week. I had only begun to play with the categories and word frequency visualizations, so I thought I would embed a few more to showcase this tool more fully. Many of these visualizations will make more sense after reading the background to last week’s practicum.
Last week I checked “oil”, “spill” and some other similar words in various word frequency counters. Voyant also mapped these frequencies:
It was also interesting, in light of this week’s network analysis, to create see networks between the words in the letters to the editor:
While the network tools evaluate the actual connections between two nodes, Voyant’s collocate tool simply evaluates the network between two word frequencies. While these visualizations look similar, they have very different statistical analysis “inside the box.”
This week we have transitioned to the tools portion of Clio 1 and are examining “ngram viewers” and text mining. These have a simple process: count words -> visualize. While the tool is simple to grasp, its power comes from the questions asked of it plus additional tweaks that can be performed by each program. To start I needed to get some words that might be important to my project and chart how they changed over time. To get some of these words, I loaded my Santa Barbara Oil Spill’s “voices” (or letters to the editor) into Voyant and outputed every DHer’s first visualization: the word cloud.
The obvious word that sticks out is “oil.” So I started here.
Bookworm’s Chronicling America (ChronAm) and Open Library (OL) start in the 19th century, but are limited by copyright and end in 1921. New York Times Chronicle searches from the 1851 to 2014. The NYT was clearly most useful as it was one of the only journalism sources to include my time period.
I’m dubious that there were more discussions of oil in the NYT articles 1860s. The likely reasons for this are mistaken OCR (I noticed some mistakenly placed articles), more discussions of kerosene, and a relative lack in the number of articles written at the time.
Yet in reviewing the American Chronicle series of newspapers- a larger collection of mostly regional papers, the spike appears again, this time in 1840.
Also notable is that there are fewer spikes compared to the NY Times, perhaps because of the larger number of papers, and the more scattered nature of regional news sources. Ultimately, this source ends in 1921 which is outside of the period I’m most interested in exploring.
Google Books provides one of the most used examples of Ngrams and also happens to include my time period. It also provides more proof that the initial NYT gram of oil (and the ChronAm, to a degree) were outliers:
It also allows several words to be plotted at once on the same graph. Though “spill” was very important in my word cloud, it was less so in this visualization because it was swallowed up by the scale of the other words. However, the NYT Chronicle ngram provided a different perspective:
What’s interesting to me about adding “spill” is that it is so correlated with oil. This indicates that in most articles including “spill”, the word “oil” is also in that article. This only starts to happen in any frequency around 1969. Of course there’s also a major spike around the Exxon Valdez spill, the Ixtoc I spill, and the Deepwater Horizon spill, the 3 other spills to receive major media attention (though Santa Barbara was orders of magnitude lower in terms of tonnes of oil spilled).
Also interesting was that most of the other ngrams did not echo the correlation. Ben Schmidt, who helped develop the Bookworm Ngram, applied it to a corpus of movie and TV scripts to develop Bookworm Movies.
This Ngram can provide a look into popular culture of the time period and unlike many of the “public domain” based ngrams, this has a seemingly opposite year range, starting in 1931 and running to, more or less, the present. In this case, it appeared to have no correlation to the relative peaks and valleys of “oil.” Searching only documentaries for oil did provide some correlation with what I expected to find, though I don’t know there was enough data, nor why the most “documentaries” about oil date from the 1940s.
It’s possible that “spill” is less likely to be used in a book than reporting. There were lots of books on oil, but not many on oil spills per the previous Google Books ngram. Fortunately, Google allows additional techniques not found in the other Ngrams. For example, the user can see what the most common modifier for a word is using the wildcard (*) character. In this example: “* spill”:
Clearly a strong rise in the use of “oil” with “spill” starting in 1967 and rapidly expanding in 1969. There is clearly much to unpack and caveat with these ngram viewers, but they can provide an interesting tool in digital history research.
So far I’ve discussed the programs where the user determines the word to be counted. The other type of ngram-esque text mining program is Voyent. This program is much more powerful in the variety of visualizations that one can reach, but also does not rely on the user for words to be counted: it counts every one. This provides value in situations where a researcher is unfamiliar with the material and needs to get a quick handle on it.
I downloaded the Voyant-Server- allowing users to run offline instances from their local server/machine. This had some great instructions:
Windows: blah blah…. but if you do this thing wrong, it might break.
Linux: You don’t really need our help for this, right?
Clearly a project with a sense of humor. I loaded my 14 Letters to the Editor and explored the tools. Voyant is extremely complex in comparison to the ngram viewers, providing many different visualizations and abilities to count the text. Here’s an example of the words around “oil” using Keywords in context:
All of the expected terms are listed: oil spillage, oil drilling, oil company, etc. With the added complexity I wasn’t able to arrive at the same simple argument that I found with the NYT Chronicle and Google Books ngrams, but it is a tool worth putting in effort to explore in order to derive more skillful analysis.
Voyant is also probably the most user friendly in terms of providing output options. The ngram viewers gave either few or no abilities to output the visualizations created by your word counting.
For over 20 years, students, scholars, and the general public have been enjoying the benefits of fulltext search and formerly paper documents. Many workflows might look something like this: Google->Wikipedia->JSTOR->Chronicling America/Proquest historic newspapers->Digital Dissertations->digitized archives->physical archives->fill in gaps from other digital sources->add something “transnational” from that digitized French archives that you could never visit otherwise. Collect everything relevant and write it up.
Yet, in a review of three years of Environmental History, the primary journal of global Environmental History scholarship, articles cited very few explicit mentions of this now common workflow. There were a handful of references to online material- a blog post, an article in a digital encylopedia, or even a reference to a particular longitude/latitude on Google Earth. But unsurprisingly, most citations were to books (likely read in physical form, but what about pre-1923 Google books/Open books/etc), physical articles (probably read via JSTOR/Proquest/etc), or archival citations (very likely to still be physical archives, but could be online). That they didn’t cite to what was actually consulted is a problem as a historian’s methodology is very important to their authority as a scholar.
McNeil’s address concerned Arnold Toynbee, (in)famous public intellectual and historian, as a proto-enviromental historian. And on what “big” histories could contribute to the field as a whole. In preparing the address/article, McNeil “read only 2%” of [Toynbee’s] output” (10-15 million words). He laments:
When I chose the subject for this address, I ignorantly assumed that Toynbee’s works would be available via Google books and I could instantly locate all the passages that use words such as environment or ecology. By the time I learned that I would have to work from the printed pages, it was too late to change my plans. But happily Veronica Boulter Toynbee, who worked as hard as her husband and a bit more carefully, prepared the indexes for all his major books.
The full text search is so ingrained in our methods that even the President of the American Society for Environmental History assumes this in his workflow. Had the books been digitized, I doubt McNeil would have completely described how he arrived at the requisite references. Perhaps giving us search terms and insight into the relevant passages as he does in his address, but not likely informing us of the OCR accuracy, any false positives, or false negatives.
The other interesting example comes from Simberhoff’s article on Aldo Leopold. I don’t mean to pick on Simberhoff, who has written an interesting article exploring one of the conservation movement’s important figures and his views on diversity vs stability in academic ecology. But Simberhoff appears to exclusively derive much of his archival material, beyond Leopold’s books, on those in the University of Wisconsin’s digitized Aldo Leopold collection.
The UW-M collection has a variety of methods to approach it’s digital works. It helpfully provides some guidelines:
Most users with a scholarly or general interest in Aldo Leopold will find that the Detailed Contents List provides the best access to the collection. It describes each file series and, within each series, each box and folder in the collection, and there are links from the description of each folder directly to the digitized material in those folders.
Alternatively, readers can use full search capability (it doesn’t list the OCR accuracy rate or additional processing). This method is best for those “who are primarily interested in whether Leopold had any connection with a particular person, place, or topic.” Though this does not include his handwritten correspondence. Simberhoff does not describe which method he took in researching Leopold and this is a problem for the reader. Did Simberhoff ignore the handwritten material while performing a full text search for “nonnative” or “ecology”? Or did he review the archive front to back?
These are serious questions that our discipline needs to answer as the archive may literally look different each time we approach it.
The process of using optical character recognition to make a series of texts machine-readable is one that I’m learning about for my final project in Clio 3. This means that successfully generating the text for one page should be simple and straightforward. While the process has been relatively simple, it has been difficult to achieve great results.
My first attempt at OCR was on Professor Robertson’s research sample. This was the first page of a hearing that he photographed at the Pinkerton Detective archive. The photos are in color with decent resolution. They are rotated 90 degrees so they fit the landscape orientation of the camera. Rotating 90 degrees was the first change I made to the photograph as OCR requires correctly oriented lines of text. I used google drive to OCR this image which was very simple once I had my settings correct. It sadly took me a rather long time to realize that Google’s OCR output is a google doc with your image first, and the text below it. I finally realized that the text below the image was supposed to represent what was on the image. There were clearly major problems with this P7150001-fixed.JPG.
First, the text on the subsequent pages was partly bleeding through due to the very thin nature of the paper (many archivists aptly call that type of paper “onion skin”). Another source of errors was that the photos were in color, so the contrast between the brown/tan paper and the dark brown text was less than if it were black and white. This can aid the human eye to figure out difficult to read words, but hinders the computer eye. So I transformed the photo into black and white with higher contrast- P7150001-bw.jpg.
This helped increase the contrast between the text and the page and reduced the bleed through effect of the thin paper. Yet my OCR actually became Worse! Another problem was non-characters in the document, like dashed lines in the heading and shadows at the crumpled edges of pages around the outside of the document. These contributed to OCR errors when the machine tried to read the “suspicious characters.” I couldn’t eliminate the border around the title, but I could crop the image to eliminate the edges- P7150001-crop.jpg. This reduced errors and made a readable sentence, even though it missed significant parts of the text. The improvements did not help achieve perfectly formatted text, but did possibly improve the results.
I could have used the techniques required of the Pinkerton text: align, increase contrast, and crop, on the 100s of legal documents, but this would take a serious amount of time. So I harnessed the power of my computer and the code of others for my purposes.
A little bit about my process for OCRing. I used <wget> to download all of the Indian Claims Commission cases from the Oklahoma State University website. These exist as un-OCR’d pdfs of the Decisions which range from the late 1940s to late 1970s- seemingly perfect fodder for a good OCR process. I employed Civil-Procedure-Codes, a package created by Kellen Funk and modified by Lincoln Mullen to OCR the pdfs. This package uses tesseract-ocr, which is considered one of the best open source options. The program also automatically did some of the processes I used on the single page. It created bitonal tiff files out of the PDFs because tesseract does better with this file type. It then output the OCR as structured text files.
With such a sophisticated program, I should get even better results. Yet, I counted the first 100 words and of these, 65 were correct for an error rate of 35%. Of these, 29 were articles, pronouns, prepositions, or single letter initials. These are all words that I would not likely be searching for in a scholarly historical search. The software did get significantly better at correctly reading the text (or perhaps the later pages were more clear). So the final error rate was lower than 35% of the first 100 words.
The other exercise for the week was to evaluate the Chronicling America newspaper OCR at the Library of Congress. The LOC helpfully allows you to see the raw OCR (they make no attempts to process their output text). Many sites keep the OCR text layer hidden so it is difficult to actually evaluate the site’s data quality. I first viewed the front page article- “100 years ago today.” The Daily Missoulian’s OCR is here. Right off the bat, it misses the title of the front page article- “Montana Progressives Indorse Compensation Act” reads as “Montana P. R ESSIVE DOR Compensation Act.” This gets many words, but misses the word which many might use as a search term- “Progressives.”
I also wanted to test the OCR on one of the older newspapers. I searched for the oldest example of “swamp” in any Washington DC newspapers and found the Daily Evening Star from 1852. The OCR only had 4 errors out of the first 40 words I checked around “swamp”, a very good accuracy rate. It’s possible this was not digitized from microfilm as the image quality was fantastic. I also wanted to check against a very modern example, the 1922 Washington Evening Star. This example of “swamp“, was also extremely well OCR’d with an equal mistake rate as the 1852 paper. Perhaps the Washington Evening Star is an outlier compared to other Chronicling America papers, but the image quality and resulting OCR text was at a very high level. After my difficulties with transforming some of my own material with what I thought was a state of the art package, I assumed this would be at a similarly low level. Yet, the Washington papers were a pleasant surprise.
When I was signing up for an account recently, it required a captcha that was so incredibly complicated with lines, twists, blurs and distortions that, it took me dozens of attempts to suceed. Clearly historians need to enlist the creators of spam bots, who are constantly innovating in character recognition and defeating simple captchas, in creating better OCR that can unlock the scanned documents of the past.
This week we reviewed what already exists online for our particular subjects and potential Clio 1 project topics. This will help us understand how our projects might use these resources, and where our work will fit into existing digital scholarship.
I began by reviewing the general online collections related to Environmental History in the 20th century. This led me to a variety of interesting sources, stories, reference sites, and general information, but few digital history projects or corpus of documents. One interesting project included the “Land Use History of the Colorado Plateau”, which explored the Colorado Plateau from a variety of disciplines (it was also interesting as an example of early academic projects- published 2002).
Another multidisciplinary project was the 1969 Santa Barbara Oil Spill published by the University of California- Santa Barbara Geography Department, but including historical photographs, material from the Marine Protected Areas project, articles putting the spill in context, and other resources.
What struck me about this project was the relative lack of contributions from historians. As such, I thought this might make for an interesting project for which there is little historical scholarship online. I searched for more resources and found Darren Hardy’s “1969 Santa Barbara Oil Spill” project which was created while he received his Masters in Environmental Science and Management. Though much of the website reuses articles also linked at the UCSB Geography Department’s website, he transcribed quite a few letters to the editor about the oil spill in 1969. Even more useful is that XML versions of these letters exist on his website and might be suitable for easy text analysis.
Following the google search, I examined other major websites of historical documents not necessarily indexed by google. The Library of Congress has an enormous amount of material, including collections on the American West, historic newspapers, and American Memory. Unfortunately, the American West collection is almost entirely pre-1920 and there is little on the Santa Barbara Oil Spill of 1969.
I also found some interesting newsreel footage of the oil spill from the Prelinger Archives (indexed by the Internet Archive). The footage is quite long and silent, so it is not particularly interesting in its current form. However, the Prelinger Collection has the most permissive copyright license- a Creative Commons Public Domain license. This allows the footage to be used and reused in any way desired. While I don’t envision much video in my final project, it could be a useful part of a website if remixed in an interesting way (perhaps this is a Clio 2 project).
The Center for the American West at UC-Boulder had quite a few publications on oil/energy topics, but focused on oil produced by hydraulic fracturing. The Digital Public Library of America (or its partners) contained few recent sources, but did have the text of a Senate hearing on the Santa Barbara Oil Spill which could be interesting to compare to Letters to the Editor from Darren Handy’s web project.
In this search I’ve learned much about what environmental history resources that exist online. While I’m excited about the possibilities of exploring the Santa Barbara oil spill of 1969, there are far fewer general resources than I expected. Relevent to post-WWII environmental history there are none of the Google Books public domain works, little in historic newspaper collections, and fewer of the large scale digitization projects which have affected legal history or literature. I expect that one of the strengths of digital environmental history might be it’s ability to be mapped and the large amounts of publicly available data provided by government agencies like the USGS, FSA, NPS, etc. But that is for a future project. In the meantime I will focus on Santa Barbara and GIS shape files from the California Marine Protected Areas Database.