Indexing and Querying Linguistic Metadata and Document Content Authors: Niraj Aswani, Valentin Tablan, Kalina Bontcheva, Hamish Cunningham
Abstract:
The need for efficient corpus indexing and querying arises frequently both in machine learning-based and human-engineered natural language processing systems. This paper presents the ANNIC system, which can index documents not only by content, but also by their linguististic annotations and features. It also enables users to formulate versatile queries mixing keywords and linguistic information. The result consists of the matching texts in the corpus, displayed within the context of linguistic annotations (not just text, as is customary for KWIC systems). The data is displayed in a graphical user interface, which facilitates its exploration and the discovery of new patterns, which can in turn be tested by launching new ANNIC queries.
Lucene is modified to create an index of a term that includes additional information linguistic information.
Recognizes that search components don’t have to been entirely hand crafted nor exclusively machine learning. (Is there a lesson for topic maps here?)
ANNIC (ANNotations-In-Context), part of GATE (General Architecture for Text Engineering), is a corpus exploration tool.
As topic maps address more sophisticated text bases, such tools will become increasingly important.
[…] Indexing and Querying Linguistic Metadata and Document Content « Another Word For It Indexing and Querying Linguistic Metadata and Document Content #topicmaps #linguistics #metadata #indexing – http://t.co/IfQywTT9… Source: tm.durusau.net […]
Pingback by Indexing and Querying Linguistic Metadata and Document Content « Another Word For It | Digitization&Metadata | Scoop.it — October 22, 2011 @ 12:48 pm