This project was much more difficult than I had imagined at the beginning of the semester, due to its complexity (for my first project). Having said that, I now feel able to tackle similar projects going forward with fewer false starts.
My first (easy) hurdle was to use Ian Milligan’s Programming Historian to recursively download all of the ICC decisions with wget. This still took a shocking amount of time to download the ~2GB of pdf files with the limiters on wget.
But it was nothing compared to OCRing the PDFs. This took over a week, even when running multiple jobs in parallel. Disappointingly, the OCR was of very poor quality. Fortunately, the volume of material kept most of the OCR errors out of the topics.
In creating the makefile, I leveled up in my ability to grep | sed |cp files in the command line, and in make. These recipes will come in handy in making my projects repeatable for others and tracking what I’ve done. When using my makefile, remember to run the recipes individually and OCR in parallel.
With my data collected, I began to analyze it with Mallet. When I first used mallet, it was easiest to run it from the command line. But once I had my scripts working in Rmallet, I wasn’t ever going back to command line Mallet again. It was so much easier to go directly to visualizations and dealing with fewer .csv and .txt files.
My R scripts were built on Matt Jockers’ text mining books, Macroanalysis and Text Analysis with R for Students of Literature. But I quickly realized that these works were designed with the questions that literature students bring to distant reading- genre, author attribution, and comparing texts against each other. But it got me in the door of text analysis. Working through some of Shawn Graham’s blog posts brought me over some hurdles in manuevering my data to the proper formats for R Mallet. Though wordclouds are not an ideal data visualization tool, I found them very useful in understanding the topics. Even more clear than some of the other topic modeling browsers and visualizations out there, if not as rigorous mathematically.
The other major difficulty I had was trying to manuever the metadata for my decisions from ancient, non-standard html tables to a workable format. Regex allowed me to generally isolate the plaintiff tribe names, but the funkyness of the 10 year old html tables was too much. Particularly when they changed formats 3 or 4 times.
The final surprise for me was how much pre-processing it took for the topics to begin to “look right” for how I wanted to analyze them. Trying many different numbers of topics, training iterations, and especially stopword lists made major differences.
I initially took the standard english stopwords list from Voyant. I then hand coded several tribes’ names. This was the “character problem” that Jockers discusses when analyzing literature. In my case, the characters of the Decisions were the tribes, lawyers, and judges mentioned in the decisions. I finally wrote an R script to pull every word from the html tables, and add only unique words to the stop words list. This worked too well, removing many words which might hold meaning in the decisions, such as “land”, “appraisal”, “expert”, etc. So I went back through and removed these names. It’s possible that only pulling from the “plaintiff tribes” as opposed to the orders and decisions would create a suitable stop word list, but it’s a manual enough process that simply using the my default stop word list will work fine.
My visualizations were a mix of topics over time and word clouds to view how the individual topics behaved. I used a cluster dendrogram to view how the topics interacted between one another.
I began to play with Word Similarity and term document matrices, but wasn’t able to glean any useful analysis from the charts I created. This just reinforced the additional work to be done once the topic model has been created and the difficulty I had in “distant” analysis.
I still have additional work and visualizations intended for this project. It’s come a long ways from 2 GB of images to where it is today, but there are still insights left to be gleaned from the Decisions of the Indian Claims Commission.