Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems! by Laura Chiticariu, Yunyao Li, and Frederick R. Reiss.
Abstract:
The rise of “Big Data” analytics over unstructured text has led to renewed interest in information extraction (IE). We surveyed the landscape of IE technologies and identified a major disconnect between industry and academia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technology by the academia. We believe the disconnect stems from the way in which the two communities measure the benefits and costs of IE, as well as academia’s perception that rule-based IE is devoid of research challenges. We make a case for the importance of rule-based IE to industry practitioners. We then lay out a research agenda in advancing the state-of-the-art in rule-based IE systems which we believe has the potential to bridge the gap between academic research and industry practice.
After demonstrating the disconnect between industry (rule-based) and academia (ML) approaches to information extraction, the authors propose:
Define standard IE rule language and data model.
If research on rule-based IE is to move forward in a principled way, the community needs a standard way to express rules. We believe that the NLP community can replicate the success of the SQL language in connecting data management research and practice. SQL has been successful largely due to: (1) expressivity: the language provides all primitives required for performing basic manipulation of structured data, (2) extensibility: the language can be extended with new features without fundamental changes to the language, (3)declarativity: the language allows the specification of computation logic without describing its control flow,
thus allowing developers to code what the program should accomplish, rather than how to accomplish it.
On the contrary, both industry and academia would be better served by domain specific declarative languages (DSDLs).
I say “doman specific” because each domain has its own terms and semantics that are embedded in those terms. If we don’t want to repeat the chaos of owl:sameAs, we had better enable users to define and document the semantics they attach to terms, either as operators or as data.
A host of research problems open up when semantic domains are enabled to document the semantics of their data structures and data. How do semantic understandings evolve over time within a community? Rather difficult to answer if its semantics are never documented. What are the best ways to map between the documented semantics of different communities? Again, difficult to answer without pools of documented semantics of different communities.
Not to mention the development of IE and mapping languages, which share a common core value of documenting semantics and extracting information but have specific features for particular domains. There is no reason to expect or hope that a language designed for genomic research will have all the features needed for monetary arbitrage analysis.
Rather than seeking an “Ur” language for documenting semantics/extracting data, industry can demonstrate ROI and academia progress, with targeted, declarative languages that are familiar to members of individual domains.
I first saw this in a tweet by Kyle Wade Grove.