Text mining 19th century novels with the Stanford Literature Lab
Yesterday, I attended a group meeting with the Literature Lab at Stanford University’s English Department, where they presented some very cool new results on mining 19th Century British and American novels. The lab, fresh on its feet, is headed by Matt Jockers and Franco Moretti, and consists of about eight graduate students (some shown below) including Cameron Blevins (who runs history-ing), and Kathryn VanAerendonk. In this post, I’m going to describe what they’re doing, and summarize some of their results so far.
The Literature Lab is tracking changes in literary style through 19th Century novels, focusing on how the frequencies of words that share a particular theme change over time. The idea is to see whether there are any themes that have “interesting” behavior over the course of the century — where “interesting” might mean increasing prevalence, decreasing prevalence, or some kind of peak.
They have a novel (if statistically disastrous) way of defining themes. A theme (or, in their words, a “semantic field”) is a group of words that satisfies two requirements
- They all have semantic or functional similarity
- They must behave in the same way over time.
One theme they found started with the seed word “integrity”. They created this theme in two steps. First, using an in-house program called Correlator, they found all the words whose frequencies in novels over time were highly correlated with “integrity”, i.e. increased and decreased in the same way as “integrity” over time. Then they manually removed all the words which they decided to be unrelated to “integrity” (e.g. “bosom”). They named this theme “abstract values”.
Different seed words create themes with different trends over time, and of course, not all trends are interesting. Nevertheless, the trend of the “abstract values” theme is interesting. The frequency of the “abstract values” theme in British novels decreased from about 0.8% of all words to about 0.2% of all words between 1791 and 1903, and the frequency of the “abstract values” theme in American novels decreased from about 0.6% to about 0.2% between 1789 and 1874.
Interestingly, they found another theme, starting with the seed word “hard”, which has the opposite trend, and contains more concrete words. The “hard” theme in British novels increased in frequency from about 1% to about 3% over the 19th Century, and in American novels increased from about 1% to about 2.5%.
To the literary scholars in the group, these serendipitous results suggest
“a more fundamental shift in the style of narration from abstraction to concreteness, from telling to showing. No longer talking about abstract values but embodying them in actions.”
Nevertheless, the work is still at an early stage. These are initial experiments, not statistically sound enough to demonstrate that these findings reflect actual trends in the data, and not false effects of the trend-reinforcing way in which the words in the themes were chosen.
To avoid this, they need a more complete analysis. I can think of two additions. The first is to use a held-out subset of the novels to mine for themes, and then test for trends in the remaining set. The second is to use a thesaurus or dictionary as a source of related vocabulary, and to see whether the trends remain. A third (potentially embarrassing) experiment is to randomly assign novels to years and see if any “interesting” or “thought-provoking” patterns emerge.
If trends like this are real, they would be fascinating, but they do need to actually exist. This is why digital humanities researchers, like all other scientists, need to talk to statisticians. Right now, they are new to methodology and know just enough probability to be dangerous. Experimental rigor in the form of “held out data” and “cross validation”, “hypothesis testing” is very foreign to them.
Aside from caveats about scientific rigor, I’d like to draw attention to is how text mining was used in combination with data visualization to uncover patterns that would have otherwise been extremely difficult to spot, and to spark off a whole set set of interesting hypotheses.
In the past, I’ve seen humanities scholars treat text mining as a curious novelty, used to confirm something they already know, or to quantify an existing academic intuition, but not entirely to be trusted. But yesterday, I saw text mining used for as more than that: it was a way to provoke investigation, find interesting hypotheses, and ask questions that didn’t even exist before.
Update, May 19th, 2010: The graphs and example themes that accompanied this post have been removed because the Literature Lab informed me that they are for internal use only.