This week we have transitioned to the tools portion of Clio 1 and are examining “ngram viewers” and text mining. These have a simple process: count words -> visualize. While the tool is simple to grasp, its power comes from the questions asked of it plus additional tweaks that can be performed by each program. To start I needed to get some words that might be important to my project and chart how they changed over time. To get some of these words, I loaded my Santa Barbara Oil Spill’s “voices” (or letters to the editor) into Voyant and outputed every DHer’s first visualization: the word cloud.
The obvious word that sticks out is “oil.” So I started here.
Bookworm’s Chronicling America (ChronAm) and Open Library (OL) start in the 19th century, but are limited by copyright and end in 1921. New York Times Chronicle searches from the 1851 to 2014. The NYT was clearly most useful as it was one of the only journalism sources to include my time period.
I’m dubious that there were more discussions of oil in the NYT articles 1860s. The likely reasons for this are mistaken OCR (I noticed some mistakenly placed articles), more discussions of kerosene, and a relative lack in the number of articles written at the time.
Yet in reviewing the American Chronicle series of newspapers- a larger collection of mostly regional papers, the spike appears again, this time in 1840.
Also notable is that there are fewer spikes compared to the NY Times, perhaps because of the larger number of papers, and the more scattered nature of regional news sources. Ultimately, this source ends in 1921 which is outside of the period I’m most interested in exploring.
Google Books provides one of the most used examples of Ngrams and also happens to include my time period. It also provides more proof that the initial NYT gram of oil (and the ChronAm, to a degree) were outliers:
It also allows several words to be plotted at once on the same graph. Though “spill” was very important in my word cloud, it was less so in this visualization because it was swallowed up by the scale of the other words. However, the NYT Chronicle ngram provided a different perspective:
What’s interesting to me about adding “spill” is that it is so correlated with oil. This indicates that in most articles including “spill”, the word “oil” is also in that article. This only starts to happen in any frequency around 1969. Of course there’s also a major spike around the Exxon Valdez spill, the Ixtoc I spill, and the Deepwater Horizon spill, the 3 other spills to receive major media attention (though Santa Barbara was orders of magnitude lower in terms of tonnes of oil spilled).
Also interesting was that most of the other ngrams did not echo the correlation. Ben Schmidt, who helped develop the Bookworm Ngram, applied it to a corpus of movie and TV scripts to develop Bookworm Movies.
This Ngram can provide a look into popular culture of the time period and unlike many of the “public domain” based ngrams, this has a seemingly opposite year range, starting in 1931 and running to, more or less, the present. In this case, it appeared to have no correlation to the relative peaks and valleys of “oil.” Searching only documentaries for oil did provide some correlation with what I expected to find, though I don’t know there was enough data, nor why the most “documentaries” about oil date from the 1940s.
It’s possible that “spill” is less likely to be used in a book than reporting. There were lots of books on oil, but not many on oil spills per the previous Google Books ngram. Fortunately, Google allows additional techniques not found in the other Ngrams. For example, the user can see what the most common modifier for a word is using the wildcard (*) character. In this example: “* spill”:
Clearly a strong rise in the use of “oil” with “spill” starting in 1967 and rapidly expanding in 1969. There is clearly much to unpack and caveat with these ngram viewers, but they can provide an interesting tool in digital history research.
So far I’ve discussed the programs where the user determines the word to be counted. The other type of ngram-esque text mining program is Voyent. This program is much more powerful in the variety of visualizations that one can reach, but also does not rely on the user for words to be counted: it counts every one. This provides value in situations where a researcher is unfamiliar with the material and needs to get a quick handle on it.
I downloaded the Voyant-Server- allowing users to run offline instances from their local server/machine. This had some great instructions:
Windows: blah blah…. but if you do this thing wrong, it might break.
Linux: You don’t really need our help for this, right?
Clearly a project with a sense of humor. I loaded my 14 Letters to the Editor and explored the tools. Voyant is extremely complex in comparison to the ngram viewers, providing many different visualizations and abilities to count the text. Here’s an example of the words around “oil” using Keywords in context:
All of the expected terms are listed: oil spillage, oil drilling, oil company, etc. With the added complexity I wasn’t able to arrive at the same simple argument that I found with the NYT Chronicle and Google Books ngrams, but it is a tool worth putting in effort to explore in order to derive more skillful analysis.
Voyant is also probably the most user friendly in terms of providing output options. The ngram viewers gave either few or no abilities to output the visualizations created by your word counting.
My voyant text can be found here: