A big question for me, as a designer of text analysis tools for the humanities is: how do the tools I’m building fit in? Sure, you can have fancy word trees and grammatical search histograms. Sure, they’re chock-full of interesting information that you can make an argument about. But where exactly in the humanistic analysis process does a scholar need things like that? I have no idea.
But there’s more. I don’t just build tools, I build environments. And that means support for reading the text, navigating it, searching it, and (most importantly) “working” with it. And I have no idea what that means either. So over the past few weeks I’ve been having hour-long chats with late-stage PhD students from the literature and history departments, and asking them to tell me about how they do research. I asked all kinds of confusing and mundane questions like, “How do you decide what to underline?” and , “Can you define formalism for me?” and, “You mean you actually copy it out by hand?” and “How do you organize all the quotes you collect?” and, “How do you go about proving that?” and, “So you scanned in everything in those boxes?”
I only did twelve of those interviews, but patterns began to emerge. So I did a survey. A simple one, with six questions about reading habits. This survey’s purpose was to confirm whether some of the patterns I noticed around reading were general. If you just want the charts summarizing the responses, you can find them here (those numbers include around 20 more responses I got while I was writing this post). For a full analysis in which I extract some general patterns in humanities scholars’ reading processes, read on.
A common task in literature study is to find examples of a theme. Until now, literary scholars searching for examples have had to rely on searching for sets of words they think are associated with the theme.
Theme-finding by searching for words poses a problem. Synonymy and the infinite variance of language mean that the same theme might surface in many different forms using many different words. Even for scholars with intimate knowledge of the text, a single set of words is not enough. Depending on their mental context, the words that come to mind might not always be complete and representative.
For example, take the Shakespearean theme of “seeing is believing” — that seeing an event with one’s own eyes is more credible than hearing about it second-hand. A scholar might search for the words “believe”, “speak”, “eyes”, and “see”. That search might be able to capture this example (from The Winter’s Tale 5.2):
Then have you lost a sight, which was to be seen, can not be spoken of.
but not this one (from King Lear 4.6):
I would not take this from report; it is, And my heart breaks at it.
As a solution, we at WordSeer propose search-by-example. This technology dates back to the 80′s in the field of information retrieval, and so far, it’s been successful in helping find relevant documents. We think it could work for theme-finding too.
With search-by-example, instead of inferring which words represent a theme, and then searching for those words, a scholar can search for sentences that match a set of examples. A scholar marks a set of examples of a theme, and the system returns a list of sentences it thinks are relevant.
This process is a cycle. When the system returns results, the scholar gives it feedback by labeling sentences “relevant” if they match the theme, and “not-relevant” if they don’t. The system gradually builds a model of what the scholar is interested in, and eventually returns results that are mostly relevant.
For example, in under five minutes, I was able to use the examples above to come up with seven more candidates:
Gracious my lord, I should report that which I say I saw, But know not how to do’t. (Macbeth 5.5)
Most noble sir, That which I shall report will bear no credit, Were not the proof so nigh. (Winter’s Tale 5.1)
I would not hear your enemy say so, Nor shall you do mine ear that violence, To
make it truster of your own report Against yourself: I know you are no truant. (Hamlet 1.2)
If in Naples I should report this now, would they believe me? (The Tempest 3.3)
They call him Doricles; and boasts himself To have a worthy feeding: but I have it Upon his own report and I believe it; He looks like sooth. (Winter’s tale 4.4)
It is not so; thou hast misspoke, misheard; Be well advised, tell o’er thy tale again: It can not be thou dost but say’ tis so: I trust I may not trust thee; for thy word Is but the vain breath of a common man: Believe me, I do not believe thee, man; I have a king’s oath to the contrary. (King John 3.1)
I do beseech you, either not believe The envious slanders of her false accusers; Or, if she be accused on true report, Bear with her weakness, which, I think, proceeds From wayward sickness, and no grounded malice. (Richard III 1.3)
Of course, this is all theory until it’s been proven to work. And while I’m not a Shakespeare scholar, I did build this particular system, so it might not be surprising that I can get a few results out of it.
So to find out whether search-by-example works, we’ve designed a five-minute study around three Shakespearean themes. There are three systems: one search, and two different example-based ones. Participants are shown an example of a theme, and asked to use a system to find as many relevant results as they can in five minutes. The systems and theme are randomly assigned.
We’ll find our answer by comparing the quality and quantity of the sentences the participants find on the three systems. Expert scholars will help us judge quality: they will rate the relevance of sentences the different systems produce (without knowing which system produced which sentence). For quantity, there is a time limit — which system produces more high-quality results in five minutes?
So, does example-based exploration work better than search for theme finding?
If you have five minutes, you can help us find out by participating in the study:
A new version of WordSeer is in the works.
It’s been guided by the advice of our long-suffering literature-scholar collaborators. And by the tales of frustration and trial-and-error of the students of the Hamlet class who tried to use WordSeer to analyze parts of the play. We also thought hard about the text analysis process as a series of steps. “What might Tanya Clement have been thinking and doing at each stage of her computational analysis of repetition in Gertrude Stein’s The Making of Americans“? “What about when we analyzed language use differences in the descriptions of men and women in Shakespeare?” Out of this has come a better (we hope) understanding of the needs of scholars of text in the humanities.
We’ve completely rebuilt WordSeer. Instead of a traditional web application with a different visualization on each page, WordSeer now works more like an environment. Almost like a desktop — with windows and menu bars and persistent, useful, objects.
However, as researchers in Human-Computer Interaction, we know that we need to do user studies. First, we need to check whether we’re on the right track. Do our improvements make for a better experience than the old version? More importantly, we need more observations. To understand the humanities text analysis process, we want to observe more humanities text analysis.
Until now, the closest we’ve come to “user studies” is an iterative bouncing-around of ideas with just three scholars. They have been more like guides and expert consultants than “users” and they helped us sketch the first lines, and refine our first ideas into something that was actually useful.
We’ve acted upon the knowledge they helped us accumulate, the result of which is the completely redesigned WordSeer. We’re looking for a bigger set of users now, for a formal study. We’re hoping to find a set of around 15 professional literature scholars who will allow us to observe them as they use WordSeer to explore a problem of genuine professional interest to them.
So what text collection could possibly interest 15 different scholars in the digital humanities community enough to want to do a computationally-assisted analysis of it? And allow us to observe them at it?
In a rare moment of epiphany, we realized we could just ask you. So here’s a poll. It’s populated with some examples, but we encourage you to respond in the “other” field. Tell us: what collection, if set up with text analysis and visualization tools, would make you interested?
In previous posts, I’ve shown how WordSeer can be used to explore small, well-defined questions: what word did Shakespeare use for ‘beautiful’? Is the occurrence of the word ‘love’ the same in the comedies and tragedies? This post is different. WordSeer has now developed enough to support a simple, but complete, exploratory analysis.
The question we’ll think about is this:
“How does the portrayal of men and women in Shakespeare’s plays change under different circumstances?”
As one answer, we’ll see how WordSeer suggests that when love is a major plot point, the language referring to women changes to become more physical, and the language referring to men becomes more sentimental. You can watch a screencast here, or just read this post.
When scholars try to make sense out of large collections of text, they frequently do two things: compare, and collect. They collect samples of “interesting” things, and compare them with each other along various relevant dimensions.
In this post, I demonstrate the collection and comparison features of WordSeer by using it to compare the usage of the word “love” in Shakespeares comedies and tragedies. You can watch the screencast, or simply read on.
A common problem in search and exploration interfaces is the vocabulary problem. This refers to the great variety of words with which different people can use to describe the same concept. For people exploring a text collection, this makes search difficult. There are only a limited number different queries they can think of to describe that concept, but they may be missing many other instances that use different words. This is an important issue for humanities scholars. Often, the very first step of a literature analysis is to comb through text, trying to find thought-provoking examples to study later.
In this post, I give an example of how our project WordSeer, a text analysis environment for humanities scholars, can be used to overcome this problem. In this example, I’ll using an instance of WordSeer running on the complete works of Shakespeare from the Internet Shakespeare Editions. It’s live, so you can follow along with this example on the web at wordseer.berkeley.edu/shakespeare.
You can read the post after the jump, or just watch this video.
On Tuesday, Feb. 1, I’ll be presenting my latest project WordSeer, at the Farsight 2011 conference on the future of search. This event will be streamed live from TechCrunch, the tech world’s favorite blog about new technology and startup news, and will be attended by high-profile techies from Bing, Google, Blekko, and the like. Please tune in at 10am PST Tuesday, and follow along with #futuresearch on twitter, and let’s get the digital humanities some high-tech exposure that day!