Skip to content
December 7, 2010 / silverasm

WordSeer: Exploring Language Use in Slave Narratives

More and more source text in the humanities gets digitized every day, making it accessible to large scale computational analysis. Nevertheless, traditional methods of humanistic analysis are based on detailed arguments built upon on close readings of individual texts. How will the field adapt? How do we use statistics and text mining to answer humanistic questions?

Zoom in to the field of American literature, and further into the realm of studying the (digitized) narratives of escaped former slaves, published by white abolitionists. There are widespread stylistic and thematic similarities among these narratives. How can text mining help literature scholars here? That’s where WordSeer, my latest project, comes in.

The MONK project at CMU, and the Voyeur project at McMaster University share the same cause as WordSeer. But, when it comes to text analysis, they are essentially search interfaces that show simple statistics about word order, type and frequency. The grammatical relationships within text are neglected.

WordSeer

WordSeer is an evolving project, as all digital humanities projects inevitably are. As my friends in the English department and I learn what we can do for each other, it will get steadily more well-defined, but right now, it’s simple: a search interface and a reading interface. The search interface allows queries based on grammatical structure, and the reading interface is for reading narratives, comparing them, and coming up with new queries.

Search

The search screen is shown below. It supports standard keyword-based search, so scholars can look for words or exact matches in the text. More interestingly, there’s grammatical search. Using grammatical relationships extracted through natural language processing, users can ask how things were described, what actions were performed upon them and by them, who possessed certain things, or what was possessed by them.

For example, the figure above (click for larger image in new window) shows the query, ‘give all adjectives that are applied to the words “slave, bondman, negro”‘. The system returns not only a list of occurrences in the narratives, but also automatically-generated graphs, showing the frequencies of the different words. As you can see, “poor” is the most frequent adjective. The results are sortable, and filterable: clicking on bars filters the list to show just results containing those words. Above, I’ve filtered to show just the instances where “valuable” is applied to “slave”.

Reading

Interviews with our literary scholar friends suggested that a search interface alone would not be enough, so WordSeer supports reading narratives individually.

The reading view is shown below. Scholars can select one (or, indeed many) sentences from the search results and be taken to a reading screen, where the narratives are opened up to the correct place. Grammatical search doesn’t end there, however, because the entire text is interactive.

Highlighting a portion of a sentence and clicking the “examine” button (bottom right corner) shows the text pattern, as well as all the grammatical relationships in the highlighted portion. For example, I clicked on a passage about hospitals, and was presented with the pattern-examiner screen (below).

I can select some patterns, either the original passage or some grammatical patterns, and examine them further. I can use them as search queries and be taken back to the original search screen, I can save them for later, or I can view their distributions in the text I’m reading.

Being able to compare the distribution of phrases or patterns across texts can give an idea of how similar the texts are, or of how much their subject matter overlaps. For example, if I wanted to know where plantations were mentioned in these texts, I would highlight the word, “plantation” and click “See in Text”, giving the result below.

The white column represent the length of the entire text, and green bars indicate that the pattern of interest occurred. If I had selected multiple patterns, I would see different colored bars.Clicking on any of the little green bars takes me to an occurrence of the pattern, highlighted in the text.

Language Processing

All of this works because I applied language processing to the text beforehand, and stored the information a database for quick access. I applied part-of-speech tagging, syntactic parsing, and dependency parsing to decompose sentences into their grammatical constituents. For example, the sentence, “The cruel man beat us severely” contains the word “cruel” which is an adjective modfier of the word “man”, which is a noun. There is verb object relation between “beat” and “us”, and a verb subject relation between “man” and “beat”.

If you want to know more about natural language processing, I gave a BootCamp about text mining at THATCamp SF recently, here are the slides [pdf]. I also wrote a blog post introducing the subject for a digital humanities audience.

What next?

Syntactic analysis is just a small part of what natural language processing can do. Right now, I’m working on being able to track named entities through a narrative and see descriptions applied to them, and actions in which they participate.

About these ads

7 Comments

Leave a Comment
  1. Sangeeta Saksena / Dec 11 2010 7:30 am

    Exciting work! I liked the automatically generated graphs. Looks like the little green ideas are awake and working furiously!

    • silverasm / Dec 11 2010 9:39 am

      Aw, thanks mom!

  2. Joan Shaffer / Mar 16 2011 9:08 pm

    Terrific project!

  3. Steve D / Dec 22 2011 12:55 pm

    I notice that the word-list gives words with a part of speech — but is it really the total for just the occurrences of the word-form that were that POS (you seem to be using the Penn tag-set — via NTLT or Gate or something?) I think you need to be clearer about what is actually being shown. Also, it would be invaluable to be able to directly access the *other* POS entries for the same spelling, and also the other forms of the same lemma (which are often, though certainly not always, what the linguistic and/or literary scholar wants. And how would this scale to a far, far larger collection?

    You should look at the GramCord package which provided similar functionality with the Greek NT since the 80’s, and you might want to consider providing not merely hits-per-work, but hit density display (sliding and non-sliding window approaches) and some other fairly standard measures. There’s a substantial literature on this sort of thing, e.g., in ACH/ALLC journals and conferences.

    Nice GUI; needs more functionality to make a real difference. Promising, though, and I hope you take it further.

    -s

    • silverasm / Dec 27 2011 9:12 am

      Thanks for the comments Steve,

      I notice that the word-list gives words with a part of speech — but is it really the total for just the occurrences of the word-form that were that POS (you seem to be using the Penn tag-set — via NTLT or Gate or something?) I think you need to be clearer about what is actually being shown.

      1. Tags gotten using the Stanford POS Tagger
      2. Yes, totals are for just that POS — and I’m sorry if it wasn’t clear, this isn’t polished enough for general use yet, as you point out.

      Also, it would be invaluable to be able to directly access the *other* POS entries for the same spelling, and also the other forms of the same lemma (which are often, though certainly not always, what the linguistic and/or literary scholar wants

      Great point! Thank you!

      And how would this scale to a far, far larger collection?

      It’s an open question right now what size collections we want to support. Obviously it doesn’t scale to more than a thousand documents in its current form — but does it need to run on a much more massive scale? My feeling is that a maximum of 2-3 thousand documents is the target for the kind of workflow we’re supporting. Close reading combined with exploration around a specific sensemaking task.

      you might want to consider providing not merely hits-per-work, but hit density display (sliding and non-sliding window approaches) and some other fairly standard measures. There’s a substantial literature on this sort of thing, e.g., in ACH/ALLC journals and conferences.

      Not sure what you mean here — tileBars-like stuff?

  4. Matt B. / Sep 12 2012 8:10 am

    Dear Mrs. Muralidharan,

    I’ve just discovered your blog and find the subject matter very exciting. I’m a librarian at a public library and I’m very interested in these subjects. Currently, the library has a fair sized collection of local historical materials. Within a year, this will dramatically increase due to the addition of a very large collection of materials from a local historian/genealogist. I would like to ask if you can suggest any sources that you consider useful as a basic introduction to this subject.

    Thank you for your consideration.

    Matt

Trackbacks

  1. Qualifying Quantity: Text Analysis and Methodology | THATCamp Virginia 2010

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: