Investigating thousands (or millions) of documents by visualizing clusters
From the post:
This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs.
Very good presentation which includes:
- “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
- The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
- “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
- Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
- Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.
- Knight News Challenge application for “Overview,” the open-source system we’d like to build for doing this and other kinds of visual explorations of large document sets. If you like our work, why not leave a comment on our proposal?
The war logs are not “yesterday’s news.” People of all nationalities are still dying. And those responsible go unpunished.
Public newspapers and military publications will list prior postings of military personnel. Indexing those against the locations in the war logs using a topic map could produce a first round of people to ask about events and decisions reported in the war logs. (Of course, our guardians have cleansed all the versions I know of and pronounced them safe for our reading. Unaltered versions would work better. I can only imagine what would have happened to Shakespeare as a war document.)