Tools for Exploring Text: Natural Language Processing
Natural language processing (NLP), also known as computational linguistics, is a set of models and techniques for analyzing text computationally. In the context of the digital humanities, it can help take a question that a literary scholar or historian might ask of a body of text, and help turn it into a quantitative hypothesis. In a previous post, I talked about how visualization can be used to get a sense of text; this is the next in the series.
Throughout this post, we’ll try to answer a hypothetical question a scholar in the humanities, perhaps a literary scholar or historian, might be interested in:
“How is the character Mary talked about in this novel or historical text?“
It’s fairly open ended – what does “talked about” mean? How do we translate this into computational terms? In this post, I’ll describe some tools that natural language processing (NLP) has to offer, and show how each can be used to tackle this question along with pointers to sofware and tutorials.
The goal of of NLP is to model the workings of natural language as we speak, read, and write it, so all the tools here are motivated by some kind of language model.
These are strings of consecutive words within a sentence. Take the sentence,
Mary was born on a cold March morning.
morning are 1-grams.
cold March and
born on are examples of 2-grams. This might seem like a crude way of modeling a language, but n-grams capture a lot of information because we speak grammatically. We can use them to get a sense of how Mary is talked about, for example, by asking what 4-grams we can find that start with:
- Mary is ____.
- Mary is a ____.
- Mary is an ____.
Toolkits like NLTK and openNLP come with tutorials that explain how to get started on such analyses. We may find that Mary is
a darling, but of course we may end up with a fragment like
Mary is getting, from the sentence
Mary is getting sleepy. which isn’t what we’re looking for.
N-grams capture information about grammar through frequency of use: the less frequent an n-grams is, the less likely it is to be grammatical. But as early as 1957, Noam Chomsky argued that there is much more to modeling grammar than this in his famous sentence,
“Colorless green ideas sleep furiously.”
which is perfectly grammatical even though it contains no frequent n-grams.
Parts of Speech
Parts of speech (POS) are a more detailed way of modeling of how we use words: verbs refer to actions, adjectives describe properties of things, nouns refer to entities, and so on. NLP algorithms have long been capable of assigning part-of-speech labels to words in sentences with high accuracy. This task is called POS-tagging, and we can use it to refine our analysis of how Mary is talked about by asking, “What are the adjectives that occur within five words of ‘Mary’?”. From a fragment like:
It was a cloudy day, which young Mary found fortunate. “Are we close yet?” her companion asked. “No, actually, it’s quite far,” Mary replied.
While this is more precise than an n-gram analysis in that we only see adjectives now, it’s still not perfect because only
young refers to Mary. This is because “within five words” is still an approximation for what we really want: adjectives that refer to Mary. We could try this analysis again, with a smaller window 1 or 2 words, but then we might miss many adjectives.
Parsing Phrase Structure
The structure of natural language extends beyond parts of speech, because words have relationships with each other. For example, in English, we say that every sentence has a main verb, which has a subject and, depending on the verb, an object, and an indirect object as well. These constituent parts can be small units like nouns, or bigger units, phrases which have their own constituents. NLP algorithms called parsers analyze sentences and return their internal structure. The Berkeley Parser, for example, parsed the following sentence:
She thinks Mary is nice to animals.
where the symbols on each branch represent parts of speech and phrase types. For example,
ADJP is an adjective phrase,
NP a noun phrase, and
VP a verb phrase. Here is a description of the standard set of labels.
We can use these concepts to ask very precise questions now. Referring to the tree above, if we’re searching for descriptions like “Mary is ____”, we’re searching for
ADJP‘s (adjective phrases)which are part of a
VP (verb phrase) containing the word
is, and which immediately follows the word
The easiest parser to use is the Stanford Parser, which parses about 4-5 sentences a second. Using their Tregex software (which is a little harder to use), you can browse the output and search for specific patterns like the one above.
Dependency relations: grammatical structure:
The most precise way to ask which adjectives describe Mary is to look directly at grammatical relationships, and ask which adjectives modify
Mary. Modern parsers can do this accurately. For example, the Stanford Parser could look at the phrase structure in the sentence above (Figure 1) and return the following representations:
- nsubj(She, thinks)
- ccomp(thinks, is)
- cop(nice, is)
- nsubj(nice, Mary)
- xcomp(nice, animals)
Mary indicates, for example, that
nice is what
Mary is. Parsers that extract these types of relationships are called dependency parsers, they extract these grammatical relationships from phrase structures like the one in Figure 1. The Stanford Parser is one of many that includes this ability, and here [PDF] is a list of all the dependency relationships it can extract.
Using dependency parsers gives us a lot of power: we can ask for all the adjectives that apply to Mary and locate them with high accuracy. We can find the verbs of which Mary is a subject, and those of which she is an object and see if there are any interesting patterns, or we can look at all the conjunctions in which Mary participates. A visualization like the one above, specifically designed for visualizing grammatical relationships (more here), might then make excellent food for thought.
Topic modeling: a statistical approach
With the availability and relative popularity of topic modeling algorithms in machine learning toolkits like Mallet, it would not be appropriate to leave this class of analysis out of my post. Topic models were originally developed as a way to represent a large collection of documents in a compact way, but are interesting to more people now because the “topics” they produce can sometimes correspond to coherent concepts.
One way of representing a document in a compact way is by representing it as a set of word counts. This bag-of-words contains no information about relative ordering, only information about co-occurrence. Topic modeling is motivated by the idea that there are more words in a language than topics to which they belong, so documents can be represented even more compactly by a set of topics, where a topic itself encodes some distribution of the probability of words. For example, one can imagine that every article in the literature on psychology can be compactly represented by its proportions of a vocabulary of topics such as experiments, personality, drugs, theories, cognition etc.
Below are the most frequent words in the 9 most frequent automatically-extracted topics in the abstracts of the Psychological Review, extracted using this topic modeling toolbox.
‘similarity bias strategies drug systematic biases conditions’
‘order serial search process parallel elements attention’
‘stimulus response stimuli responses color cs increase’
‘ss s change rate normal underlying practice’
‘self individual situations individuals those others consequences’
‘environment general behaviors constraints internal other external’
’2 experiments single results experimental high trial’
‘personality variables measures research consistency issues cross’
‘pattern patterns changes critical false food sequences’
This method can be applied to any text, and can give interesting results when paired with humanistic intuition. For an illustrative example from the digital humanities (much better than any I could make up in involving Mary), read the work of Cameron Blevins, a history Ph.D. student at Stanford, who has used topic modeling to glean relationships and trends from a text he was studying: Martha Ballard’s diary. Finally, for an excellent, and thorough, introduction aimed at the digital humanities audience, I can’t think of a better piece than Scott Weingart’s guided tour of topic modeling.
The techniques I’ve talked about here are building blocks. Natural language processing algorithms exist for many more complicated (and potentially more useful) purposes: named entity recognition, semantic similarity calculation, relationship extraction, opinion mining, pronoun resolution, summarization, question answering, translation…
There are many tools, and they’re probably very badly documented, but hopefully I’ve managed to advance the case for considering sophisticated language processing like this part of the natural toolkit of the digital humanities.