Macroanalysis Methodology

The basic process to mine the Decisions involved scraping them using a downloading program called wget, OCRing them using Google's tesseract, then running them through Latent Dirichlet Allocation processing- a program called MALLET. To do most of this work, I used the statistical programming language R, and the integrated development environment called RStudio to perform the analysis and visualize the results. Latent Dirichlet Allocation, a form of topic modeling requires much trial and error in creating stoplists, which are the words ignored by the program. In my case I used a traditional English stop list of words like a, the, it, plus tribal and geographical names. Had I not removed these, my topics would have mostly been tribal groups like “Pueblos” or “Sioux.” Literary practitioners of “distant reading” call this the “character problem.”5 I also removed common geographical terms and especially states and cities as these weren't at the heart of my interest. What was left was particular topics that the cases were about. Topic modeling also requires the user to determine an optimal number of topics, which I did with trial and error. All of my code is on my github page.

Typical ICC Land Case

The images below show a word cluster of the topics, larger words are more heavily weighted in the topic. The graph shows the relative impact of the topic in each decision over time. The following topics follow what Rosenthal describes as a prototypical ICC case.6

Phase Description Topic Word Cluster Topic Over Time
ONE In the first phase, the tribe would have to prove title to their land and that no other tribe had similar claim to the land. Frequently anthropologists weighed in on the tribes land boundaries in relation to other tribes. Topic 21 - land title word cluster Topic 21 - land title topic over time
TWO In the second phase, the government and tribe would determine valuation and liability for the land. This would often involve experts for both sides using historic documents describing land sales to white settlers to arrive at the price per acre of land when it was ceded to the government. Topic 34 - land value word cluster Topic 34 - land value topic over time
THREE In the final phase, the Commission calculated allowable offsets, interest rates, and attorney fees. Essentially, the government was allowed to subtract any funds or goods they had provided to the tribe beyond what their treaties obligated. Topic 13 - government offsets word cluster Topic 13 - government offsets topic over time

Expected Behaviors

The three phases match up with their topics and occur approximately when we would expect them to occur over the span of the ICC's lifespan. So our test passes the logic test and behaves as expected. This baseline provides confidence that our other findings have significance and are not randomly generated.

A topic based around expert witnesses behaved similarly to how the prevailing historiography described it. The frequency with which lawyers got paid also increased throughout the proceedings (unsurprisingly).

Topic 23 - expert witness word cluster
Topic 23 over time
Topic 27 - lawyer money word cluster
Topic 27 over time

Looking at the Decisions from a Distance

Many of the topics line up with what is expected based on the historiography. But, as Ben Schmidt has noted, topics that stand on their own, analyzed over time should be given additional scrutiny. A cluster dendrogram is an excellent way to review the topics as they interact together. The algorithm determines which topics are most like each other. To employ the oft-used puzzle analogy- those topics represent the decisons like puzzle pieces represent a whole picture. If I’d sliced them into twice as many topics, they would look very different than now, but would line up to form the same whole. If I reduced the number of topics, aspects of each would fold into the others, but they would likely folk in where the dendrogram indicates.

Cluster Dendrogram

So “lands approximately early found purchase” should fold into “treaty lands land ceded consideration” (lower-right of the dendrogram). This certainly makes intuitive sense, but do all the topics fit together logically?

They more or less do. There are a few odd connections. The nearest neighbor to the previously mentioned “expert witness” topic is “proposed settlement” (left third of dendrogram). Both are legally driven topics, but usually would come at different phases of a case. If these topics were correlated that might imply some interesting legal strategies- use an expert witness, then offer to settle. Or it could be that cases which relied on expert witnesses were more likely to settle because the underlying facts were clearly established.

The dendrogram proved useful for exploring possible connections that my reading, and the prevailing historiography haven’t picked up on. There is still work to be done to further analyze how the decisions work in conversation with the proposed topics. You can review the dendrogram at the right to see all of the topics and how they interacted together.

But one topic caught my interest as to when it peaked chronologically in the Commissions term.


5. Matthew Jockers, “Secret Recipe for Topic Modeling Themes,” for-topic-modeling-themes/, 4/12/2013.

6. Rosenthal, Their Day in Court, 161-164.