Rough Consensus, Running Standards: The Restatement Project by Jason Boehmig, Tim Hwang, and Paul Sawaya.
From part 3:
Supported by a grant from the Knight Foundation Prototype Fund, Restatement is a simple, rough-and-ready system which automatically parses legal text into a basic machine-readable JSON format. It has also been released under the permissive terms of the MIT License, to encourage active experimentation and implementation.
The concept is to develop an easily-extensible system which parses through legal text and looks for some common features to render into a standard format. Our general design principle in developing the parser was to begin with only the most simple features common to nearly all legal documents. This includes the parsing of headers, section information, and “blanks” for inputs in legal documents like contracts. As a demonstration of the potential application of Restatement, we’re also designing a viewer that takes documents rendered in the Restatement format and displays them in a simple, beautiful, web-readable version.
I skipped the sections justifying the project because in my circles, the need for text mining is presumed and the interesting questions are about the text and/or the techniques for mining.
As you might suspect, I have my doubts about using JSON for legal texts but for a first cut, let’s hope the project is successful. There is always time to convert to a more robust format at some later point, in response to a particular need.
Definitely a project to watch or assist if you are considering creating a domain specific conversion editor.