The project came to me when I was searching what digital projects exist in the field of Environmental History for Clio 1. I found Darren Hardy’s website created for a class project at UC-Santa Barbara’s Bren School of Environmental Science. The letters to the editor struck me as an interesting source which reflected the event at hand. While they are self-selected by participants who can and choose to write for the public and further limited to those letters chosen by newspaper editors, they provide a slice of the public conversation that is more representative than many sources. That they almost always include a location from which the writer is writing from makes them even more useful to analyze geographically. And thanks to Darren Hardy for transcribing a few of the hundreds of originals that are still at the Santa Barbara Central Public Library.
My first task was to collect the decisions from Darren’s website. With some practice in Lincoln Mullen’s Clio 3 class and the Programming Historian tutorial, I was able to gather them programatically using wget in the terminal using the command:
wget -r –no-parent –limit-rate=80k http://www2.bren.ucsb.edu/~dhardy/1969_Santa_Barbara_Oil_Spill/Voices/Entries/1969/2/
This downloaded all of the letters from the editor. More specifically this commanded my computer to
“get everything at this website and anything linked deeper within the structure (recursive), but not the entire parent website or other websites that don’t include that link in their URL. And get them at a speed of 80k/second to keep the server happy.”
Wget downloaded the html files with pictures, apple system files, etc. to run them as websites, but also a base xml file which is how I pulled out the text used. These letters are both good because their accuracy is excellent- I noticed no mistakes on an initial search, but bad because there are only 14.
I augmented these letters with national and regional newspapers. Initially collecting 6 from the New York Times and 6 from the Los Angeles Times, using Proquest. Proquest helpfully has a “letters to the editor” filter. I further filtered for just these two papers and the months of January to March, 1969. My keyword search was “oil” which mostly concerned the Santa Barbara spill, though there were a few missed hits on foreign oil related events or an investor grateful for a successful olive oil company. Still, a scrape of all these would have been rather successful. Micki Kaufman described her success in scraping pdfs from Proquest at the CHNM20 conference and if I had access to more letters from this era, this could earn me many more letters.
I OCR’d the pdfs using a script I got from Lincoln Mullen that works on the command line. Google’s tesseract software provides the base OCRing program. Most of these letters were in multiple columns, frequently with the beginning of the letter starting below the second column of the pdf. Because of this, the OCR would place the content of the letters haphazardly with the final paragraph sometimes occuring prior to the title. With the normal clean up of “noise” generated by the OCR and the disjointed structure, I felt like i was transcribing the letters by hand. Certainly a “transformative” processing of the letter images.
I later decided to download an additional 11 letters to the editor- these were mostly from the Los Angeles Times and 1 each from the Baltimore Sun and Christian Science Monitor. I transcribed these by hand which did take several hours, but probably not very much time savings over OCR+manual fixing. If I had access to hundreds of letters, the quality becomes much less important as OCR errors average out. As I briefly mentioned in the my main article, newspaper coverage for this period is spotty on Proquest, doesn’t exist on Chronicling America, and is becoming harder to access via microfilm. It’s a sort of public domain gray area that could make my work difficult in the future. I really wanted letters from the Houston Chronicle (conservative viewpoint) and the San Francisco Chronicle/Examiner (similar to Santa Barbara politically and has platform drilling offshore). Unfortunately, I couldn’t acccess these papers online.
With my raw text data acquired, I decided to put everything into a spreadsheet with different columns representing different aspects of the letters- titles, text, name of writer, newspaper source, and address/location. These could be mapped as is, to see where people were writing letters to the Editor. But I wanted some additional analysis about what the text of the letters contained to inform my map.
So I decided to text mine them with Mallet. The ideal number of topics to be modeled by Mallet should be directed by the research questions and number of documents. Pre-processing- adding thoughtfully to your stop list and ensuring “documents” are in chunks appropriately sized for your research is critical in achieving high quality topic lists. In my case, I mostly took out words that were common in the letters but not particularly useful to my analysis- “editor”, newspaper names, “Santa”. “Barbara”, and a few others. My stopword list is here. The topics are here and the topic proportions in each of my letters is here.
After analyzing the topics in light of the existing historiography- I looked at various articles, mostly written by scholars in disciplines outside of history like Geography or Sociology, I tried to make sense of how they fit together. Choosing “topic 0” as described in my main essay, I added these figures to my master spreadsheet. I then needed to map what I’d found.
I briefly looked at using QGIS as the primary mapping tool. I had used it to review shpfiles of oil platforms in the Pacific Ocean. The shpfiles were provided by the Bureau of Ocean Energy Management and included a datafield for when the platform was established. I was able to filter the data for oil platforms existing in 1969.
However, QGIS was a little too labor intensive as far as turning my spreadsheet into places on a map. Google Maps was also a possible choice as the ease of georeferencing data and its plug and play nature were strong positives. Google Maps could have accomplished many of the tasks I wanted, but limited me in other ways like adding the oil platform GIS data and displaying my “protest” topic data. I stumbled onto Carto DB and it was a good fit. It allowed multiple layers, georeferenced my addresses, and displayed my data how I wanted it. The pop-up info windows were especially cool in that I could add photos, and display any variety of data I wanted as long as it was in my master spreadsheet.
I did have more difficulty in georeferencing than I did with Google Maps, even though I think it draws from a similar georeferencing service/API. For some reason it had difficulty with Palo Alto and Santa Barbara, CA (full addresses in Santa Barbara worked fine). For these errors, I manually searched Google maps and got the latitude and longitude of the areas by right clicking the cities. I slightly changed the multiple Santa Barbara references so that the bubbles would not all line up on top of each other.
I could have displayed the text of each letter to the editor, but this probably would have lead me astray of fair use and violated the copyrights of the Santa Barbara News-Press or Proquest. But a word cloud from Voyant would be “transformative” and still display the main concepts behind each letter. To show each wordcloud on the map, I needed to create one for each document. I used Voyant and saved them to my reclaim hosting space. Then I added the word cloud .png link for each letter to my master database. When clicking a “letter” Carto DB pulled from my reclaim images and displayed it on the info window. Cool!
This project has taught me much about what works and doesn’t work for digital scholarship. While the topics derived from Mallet seemed to match Molotch’s article on types of response to power, they appeared much muddier on the map. I know that many more letters would likely be needed to derive meaning from them. I’ve also learned the amount of time that digital projects take to fix data and put it into the format you need to process it. I expect that I’ll use some of the tools as I progress towards my doctorate, but I have also learned which tools are very much in beta and which I don’t particularly need right now. This information may prove more valuable than what I’ve currently learned from my “voices” of the Santa Barbara oil spill.