Tools for Exploring Text: Visualization
In the light of my new tool to help navigate the New York Times, I’ve been reading about previous approaches to the problem of making sense of large collections of text. As far as I can tell, the research seems to come from three different communities, and to answer three slightly different questions
- From the visualization community: how can we display aspects of text collections to give users a sense of what they contain?
- From the UI community: what kinds of interactions and information do users find useful when exploring text collections?
- From the NLP community: what can we extract from text collections that might give people a sense of their contents?
In this post, I’m going to summarize what seemed like the high points of the visualization sub-field, targeted towards the digital humanities community.
Visualization: how can we display aspects of text collections to give users a sense of what they contain?
The big problem that text collections (e.g. digital libraries) pose to visualization is that they consist of unstructured text ( contrasted with structured text, as in a Wikipedia infobox). Unstructured text is difficult to visualize for two reasons. First, it is not what we usually think of as “data”: no inherent order, no clear hierarchy or relationship structure. Second, it’s just unwieldy: it takes up a lot of space, doesn’t lend itself to compact symbolic representation, and is rarely pre-attentive (easy to make out without really paying attention).
Nevertheless, Martin Wattenberg and Fernanda Viegas (IBM Research, ManyEyes, Flowing Media) have come up with great solutions for the problem of getting a “visual perspective” — a sense of similarity, importance, relevance, and relationship on a text — and they seem to have had all the ideas lately:
• Phrase nets
• Word trees
• Two-word clouds (touched on in this magazine article [pdf])
Wattenberg and Viegas use tried-and-tested methods for extracting syntactic structure — parts of speech, phrase structures, grammatical relationships (e.g. “iPad prices” is the subject of “fell” in “iPad prices fell by 70% this morning”). The natural language processing to do this has been around for more than 20 years, and is actually very reliable. The downside is that it produces very rough statistics: frequencies and co-occurrence counts of different kinds of language patterns. Their focus is on the quality of the visualization: responsiveness, legibility, and the understandability of the resulting graphics.
a. Phrase Nets
Phrase nets are a way of visualizing relationships between words or phrases. You select a pattern, such as “X at Y” (above), or a grammatical relationship such as “X is-the-subject-of Y “, and the algorithm creates a visualization of the most frequent occurrences of the pattern, with larger font sizes indicating higher frequency, and a darker value to indicate that the word occurs in the “X” position more often than in the “Y” position.
This is the best, easiest way I’ve seen do do an in-depth exploration of relationships in a text. The visualization is intended to be very quick to re-draw, so you can query different relationships as they strike you.
You can explore phrase nets built from your own text at the IBM Many Eyes project website. The research paper [Google scholar] has more visualizations.
b. Word Trees
Word trees are a way of visualizing the sequences of words in a text. In the figure above, there’s a line between two words if the second follows the first. These are great for exploring the contexts in which words appear, and revealing patterns in the way some are more frequent than others. IBM Many Eyes lets you create these out of your own text .
c. Two-word word clouds
We’re all familiar with tag clouds: displays of words varying in size by how frequent they are. While tag clouds are great for exploring a collection of tags, which are meaningful by themselves, they are less useful for words.Words are less informative, which is why the tag-cloud method can give less than intuitive results when applied to them. Context is an important source of meaning with words, so two-word tag clouds are a way to include some of it. I think they give a better sense of the contents of the text than the corresponding single-word cloud.