Named Entity Tutorial (LingPipe)
While looking for something else I ran across this named entity tutorial at LingPipe.
Other named entity tutorials that I should collect?
Named Entity Tutorial (LingPipe)
While looking for something else I ran across this named entity tutorial at LingPipe.
Other named entity tutorials that I should collect?
Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts by Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, and Ulf Leser. (Database (2013) 2013 : bat020 doi: 10.1093/database/bat020)
Abstract:
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction.
Database URL: http://www.cellfinder.org/.
Another extremely useful data curation project.
Do you get the impression that curation projects will continue to be outrun by data production?
And that will be the case, even with machine assistance?
Is there an alternative to falling further and further behind?
Such as abandoning some content (CNN?) to simply forever go uncurated? Or the same to be true for government documents/reports?
I am sure we all have different suggestions for what data to dump alongside the road to make room for the “important” stuff.
Suggestions on solutions other than simply dumping data?
Hilary Mason (live, data scientist) writes about Google confusing her with Hilary Mason (deceased, actress) in Et tu, Google?
To be fair, Hilary Mason (live, data scientist), notes Bing has made the same mistake in the past.
Hilary Mason (live, data scientist) goes on to say:
I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!
Is entity disambiguation a hard problem?
Or is entity disambiguation a hard problem after the act of authorship?
Authors (in general) know what entities they meant.
The hard part is inferring what entity they meant when they forgot to disambiguate between possible entities.
Rather than focusing on mining low grade ore (content where entities are not disambiguated), wouldn’t a better solution be authoring with automatic entity disambiguation?
We have auto-correction in word processing software now, why not auto-entity software that tags entities in content?
Presenting the author of content with disambiguated entities for them to accept, reject or change.
Won’t solve the problem of prior content with undistinguished entities but can keep the problem from worsening.
Learning from Big Data: 40 Million Entities in Context by Dave Orr, Amar Subramanya, and Fernando Pereira, Google Research,
A fuller explanation of the Wikilinks Corpus from Google:
When someone mentions Mercury, are they talking about the planet, the god, the car, the element, Freddie, or one of some 89 other possibilities? This problem is called disambiguation (a word that is itself ambiguous), and while it’s necessary for communication, and humans are amazingly good at it (when was the last time you confused a fruit with a giant tech company?), computers need help.
To provide that help, we are releasing the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages — over 100 times bigger than the next largest corpus (about 100,000 documents, see the table below for mention and entity counts). The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.
Suggestions for using the data? The authors have those as well:
What might you do with this data? Well, we’ve already written one ACL paper on cross-document co-reference (and received lots of requests for the underlying data, which partly motivates this release). And really, we look forward to seeing what you are going to do with it! But here are a few ideas:
- Look into coreference — when different mentions mention the same entity — or entity resolution — matching a mention to the underlying entity
- Work on the bigger problem of cross-document coreference, which is how to find out if different web pages are talking about the same person or other entity
- Learn things about entities by aggregating information across all the documents they’re mentioned in
- Type tagging tries to assign types (they could be broad, like person, location, or specific, like amusement park ride) to entities. To the extent that the Wikipedia pages contain the type information you’re interested in, it would be easy to construct a training set that annotates the Wikilinks entities with types from Wikipedia.
- Work on any of the above, or more, on subsets of the data. With existing datasets, it wasn’t possible to work on just musicians or chefs or train stations, because the sample sizes would be too small. But with 10 million Web pages, you can find a decent sampling of almost anything.
Those all sound like topic map tasks to me, especially if you capture your coreference results for merging with other coreference results.
Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.
From the post:
Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.
For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.
Details follow on how to create this data set.
Very cool!
The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.
But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.
From the project page:
Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).
Features
- High performance.
- Highly configurable.
- Support for CSV, JDBC, SPARQL, and NTriples DataSources.
- Many built-in comparators.
- Plug in your own data sources, comparators, and cleaners.
- Command-line client for getting started.
- API for embedding into any kind of application.
- Support for batch processing and continuous processing.
- Can maintain database of links found via JNDI/JDBC.
- Can run in multiple threads.
The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.
Excellent news on the data depulication front!
And for topic map authors as well (see the examples).
Kudos to Lars Marius Garshol!
Developing CODE for a Research Database by Ian Armas Foster.
From the post:
The fact that there are a plethora of scientific papers readily available online would seem helpful to researchers. Unfortunately, the truth is that the volume of these articles has grown such that determining which information is relevant to a specific project is becoming increasingly difficult.
Austrian and German researchers are thus developing CODE, or Commercially Empowered Linked Open Data Ecosystems in Research, to properly aggregate research data from its various forms, such as PDFs of academic papers and data tables upon which those papers are based, into a single system. The project is in a prototype stage, with the goal being to integrate all forms into one platform by the project’s second year.
The researchers from the University of Passau in Germany and the Know-Center in Graz, Austria explored the challenges to CODE and how the team intends to deal with those challenges in this paper. The goal is to meliorate the research process by making it easier to not only search for both text and numerical data in the same query but also to use both varieties in concert. The basic architecture for the project is shown below.
Stop me if you have heard this one before: “There was this project that was going to disambiguate entities and create linked data….”
I would be the first one to cheer if such a project were successful. But, a few paragraphs in a paper, given the long history of entity resolution and its difficulties, isn’t enough to raise my hopes.
You?
A Consumer Electronics Named Entity Recognizer using NLTK by Sujit Pal.
From the post:
Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn’s Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.
The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.
The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.
Should help with mining data for “entities” (read “subjects” in the topic map sense) for addition to your topic map.
I did puzzle over the suggestion for improvement that reads:
Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.
Post-authoring reference resolution might benefit from that approach.
But, if references were resolved by authors during the creation of a text, such as the insertion of Wikipedia references for entities, a different result would be obtained.
In those cases, assuming the author of a text is identified, they can be associated with a particular set of reference resolutions.
Entity disambiguation using semantic networks by Jorge H. Román, Kevin J. Hulin, Linn M. Collins and James E. Powell. Journal of the American Society for Information Science and Technology, published 29 August 2012.
Abstract:
A major stumbling block preventing machines from understanding text is the problem of entity disambiguation. While humans find it easy to determine that a person named in one story is the same person referenced in a second story, machines rely heavily on crude heuristics such as string matching and stemming to make guesses as to whether nouns are coreferent. A key advantage that humans have over machines is the ability to mentally make connections between ideas and, based on these connections, reason how likely two entities are to be the same. Mirroring this natural thought process, we have created a prototype framework for disambiguating entities that is based on connectedness. In this article, we demonstrate it in the practical application of disambiguating authors across a large set of bibliographic records. By representing knowledge from the records as edges in a graph between a subject and an object, we believe that the problem of disambiguating entities reduces to the problem of discovering the most strongly connected nodes in a graph. The knowledge from the records comes in many different forms, such as names of people, date of publication, and themes extracted from the text of the abstract. These different types of knowledge are fused to create the graph required for disambiguation. Furthermore, the resulting graph and framework can be used for more complex operations.
To give you a sense of the author’s approach:
A semantic network is the underlying information representation chosen for the approach. The framework uses several algorithms to generate subgraphs in various dimensions. For example: a person’s name is mapped into a phonetic dimension, the abstract is mapped into a conceptual dimension, and the rest are mapped into other dimensions. To map a name into its phonetic representation, an algorithm translates the name of a person into a sequence of phonemes. Therefore, two names that are written differently but pronounced the same are considered to be the same in this dimension. The “same” qualification in one of these dimensions is then used to identify potential coreferent entities. Similarly, an algorithm for generating potential alternate spellings of a name has been used to find entities for comparison with similarly spelled names by computing word distance.
…
The hypothesis underlying our approach is that coreferent entities are strongly connected on a well-constructed graph.
Question: What if the nodes to which the coreferent entities are strongly connected are themselves ambiguous?
Swoosh: a generic approach to entity resolution by Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.
Do you remember Swoosh?
I saw it today in Five Short Links by Pete Warden.
Abstract:
We consider the Entity Resolution (ER) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of the “features” ( e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.
It sounds familiar.
Running some bibliographic searches, looks like 100 references since 2011. That’s going to take a while! But it all looks like good stuff.
Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen.
In the Foreword, William E. Winkler (U. S. Census Bureau and dean of record linkage), writes:
Within this framework of historical ideas and needed future work, Peter Christen’s monograph serves as an excellent compendium of the best existing work by computer scientists and others. Individuals can use the monograph as a basic reference to which they can gain insight into the most pertinent record linkage ideas. Interested researchers can use the methods and observations as building blocks in their own work. What I found very appealing was the high quality of the overall organization of the text, the clarity of the writing, and the extensive bibliography of pertinent papers. The numerous examples are quite helpful because they give real insight into a specific set of methods. The examples, in particular, prevent the researcher from going down some research directions that would often turn out to be dead ends.
I saw the alert for this volume today so haven’t had time to acquire and read it.
Given the high praise from Winkler, I expect it to be a pleasure to read and use.
Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision by David Nadeau (PhD Thesis, University of Ottawa, 2007).
Abstract:
Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.
Nadeau demonstrates the successful construction of a Named Entity Recognition (NER) system using a few supplied examples for each entity.
But what explains the lack of annotation where the entities are well known? The King James Bible? Search for “Joseph.” We know not all of the occurrences of “Joseph” represent the same entity.
Looking at the client list for Infoglutton, is there a lack of interest in named entity recognition?
Have we focused on techniques and issues that interest us, and then, as an afterthought, tried to market the results to consumers?
A Survey of Named Entity Recognition and Classification by David Nadeau, Satoshi Sekine (Journal of Linguisticae Investigationes 30:1 ; 2007)
Abstract:
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. We present here a survey of fifteen years of research in the NERC field, from 1991 to 2006. While early systems were making use of handcrafted rule-based algorithms, modern systems most often resort to machine learning techniques. We survey these techniques as well as other critical aspects of NERC such as features and evaluation methods. It was indeed concluded in a recent conference that the choice of features is at least as important as the choice of technique for obtaining a good NERC system (E. Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated and compared is essential to progress in the field. To the best of our knowledge, NERC features, techniques, and evaluation methods have not been surveyed extensively yet. The first section of this survey presents some observations on published work from the point of view of activity per year, supported languages, preferred textual genre and domain, and supported entity types. It was collected from the review of a hundred English language papers sampled from the major conferences and journals. We do not claim this review to be exhaustive or representative of all the research in all languages, but we believe it gives a good feel for the breadth and depth of previous work. Section 2 covers the algorithmic techniques that were proposed for addressing the NERC task. Most techniques are borrowed from the Machine Learning (ML) field. Instead of elaborating on techniques themselves, the third section lists and classifies the proposed features, i.e., descriptions and characteristic of words for algorithmic consumption. Section 4 presents some of the evaluation paradigms that were proposed throughout the major forums. Finally, we present our conclusions.
A bit dated now (2007) but a good starting point for named entity recognition research. The bibliography runs a little over four (4) pages and running those citations forward should capture most of the current research.
Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity by David Nadeau, Peter D. Turney and Stan Matwin.
Abstract:
In this paper, we propose a named-entity recognition (NER) system that addresses two major limitations frequently discussed in the field. First, the system requires no human intervention such as manually labeling training data or creating gazetteers. Second, the system can handle more than the three classical named-entity types (person, location, and organization). We describe the system’s architecture and compare its performance with a supervised system. We experimentally evaluate the system on a standard corpus, with the three classical named-entity types, and also on a new corpus, with a new named-entity type (car brands).
The authors confide successful application of their techniques to more than 50 named-entity types.
They also recite heuristics that they apply to texts during the mining process.
Is there a common repository of observations or heuristics for mining texts? Just curious.
Source code for the project: http://balie.sourceforge.net.
Answer to the question I just posed?
A Resource-Based Method for Named Entity Extraction and Classification by Pablo Gamallo and Marcos Garcia. (Lecture Notes in Computer Science, vol. 7026, Springer-Verlag, 610-623. ISNN: 0302-9743).
Abstract:
We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.
Of particular interest if you are interested in adding NEC resources to the FreeLing project.
The introduction starts off:
Named Entity Recognition and Classification (NERC) is the process of identifying and classifying proper names of people, organizations, locations, and other Named Entities (NEs) within text.
Curious, what happens if you don’t have a “named” entity? That is an entity mentioned in the text but that doesn’t (yet) have a proper name?
Thinking of legal texts where some provision may apply to all corporations that engage in activity Y and that have a gross annual income in excess of amount X.
I may want to “recognize” that entity so I can then put a name with that entity.
1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012
Important Dates:
Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.
From the homepage of the workshop:
About the Workshop:
The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.
Topics
The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):
- Data acquisition and processing (crawling, storage, and indexing)
- Dealing with noisy, vague and incomplete data
- Integration of data from multiple sources
- Identification, resolution, and representation of entities (in documents and in queries)
- Retrieval and ranking
- Semantic query modeling (detecting, modeling, and understanding search intents)
- Novel entity-oriented information access tasks
- Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
- Test collections and evaluation methodology
- Case studies and applications
We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.
All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.
GATE Teamware: Collaborative Annotation Factories
From the webpage:
Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.
It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)
GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:
- Loading document collections (a “corpus” or “corpora”)
- Creating re-usable project templates
- Initiating projects based on templates
- Assigning project roles to specific users
- Monitoring progress and various project statistics in real time
- Reporting of project status, annotator activity and statistics
- Applying GATE-based processing routines (automatic annotations or post-annotation processing)
I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.
Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.
Then I read:
Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.
Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”
Let me know if you take the plunge!
Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs by Kate Bryne.
I ran across this while looking for RDF graph material today. Delighted to find someone interested in the problem of what do we do with existing data, even if new data is in some semantic web format?
Abstract:
The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the SemanticWeb but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built.
Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives.
The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible.
Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates.
These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.
This will take some time to read but it looks quite enjoyable.
Entity Matching for Semistructured Data in the Cloud by Marcus Paradies.
From the slides:
Main Idea
…
- Use MapReduce and ChuQL to process semistructured data
- Use a search-based blocking to generate candidate pairs
- Apply similarity functions to candidate pairs within a block
Uses two of my favorite sources, CiteSeer and Wikipedia.
Looks like the start of an authoring stage of topic map work flow to me. You?
Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings (PDF file)
There you will find:
Session 1:
Session 2
Session 3
A good start on what promises to be a strong conference series on entity-oriented search.
New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.
From the release notes:
This version of Duke introduces:
- Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
- In-memory data source added (thanks to FMitzlaff).
- Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
- Record linkage API refactored slightly to be more flexible (with FMitzlaff).
- Added utilities for building equivalence classes from Duke output.
- Made the XML config loader more robust.
- Added a special cleaner for English person names.
- Fixed bug in NumericComparator ( issue 66 )
- Uses own Lucene query parser to avoid issues with search strings.
- Upgraded to Lucene 3.5.0.
- Added many more tests.
- Many small bug fixes to core, NTriples reader, ec.
BTW, the documentation is online only: http://code.google.com/p/duke/wiki/GettingStarted.
Some of you may recall my comments on Oyster: A Configurable ER Engine, a configurable entity resolution engine.
Software wasn’t available when that post was written but it is now, along with work on a GUI for the software.
Oyster Entity Resolution (SourceForge).
BTW, the “complete” download does not include the GUI.
It is important to also download the GUI for two reasons:
1) It is the only documentation for the project, and
2) The GUI generates the XML files needed to use the Oyster software.
There is no documentation of the XML format (I asked). As in a schema, etc.
Contributing a schema to the project would be a nice thing to do.
Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.
Abstract:
We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.
The two “real world” applications are ones you are likely to encounter:
First:
Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.
Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)
Second:
Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)
This test was against 700000 financial news records.
Both tests were quite successful.
Surrogate learning looks interesting for a range of NLP applications.
Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.
Abstract:
Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
Can you say association?
Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.
Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.
Concord: A Tool That Automates the Construction of Record Linkage Systems by Christopher Dozier, Hugo Molina Salgado, Merine Thomas, Sriharsha Veeramachaneni, 2010.
From the webpage:
Concord is a system provided by Thomson Reuters R&D to enable the rapid creation of record resolution systems (RRS). Concord allows software developers to interactively configure a RRS by specifying match feature functions, master record retrieval blocking functions, and unsupervised machine learning methods tuned to a specific resolution problem. Based on a developer’s configuration process, the Concord system creates a Java based RRS that generates training data, learns a matching model and resolves record information contained in files of the same types used for training and configuration.
A nice way to start off the week! Deeply interesting paper and a new name for record linkage.
Several features of Concord that merit your attention (among many):
A choice of basic comparison operations with the ability to extend seems like a good design to me. No sense overwhelming users with all the general comparison operators, to say nothing of the domain specific ones.
The blocking functions, which operate just as you suspect, narrows the potential set of records for matching down, is also appealing. Sometimes you may be better at saying what doesn’t match than what does. This gives you two bites at a successful match.
Surrogate learning, although I have located the paper cited on this subject and will be covering it in another post.
I have written to ThomsonReuters inquiring about availability of Concord, its ability to interchange mapping settings between instances of Concord or beyond. Will update when I hear back from them.
Introducing fise, the Open Source RESTful Semantic Engine
From the post:
fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.
As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.
Presenting the software in Q/A form:
What is a Semantic Engine?
A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.
Current semantic engines can typically:
- categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the Business, Lifestyle, Technology categories? …);
- suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
- find related documents in the local database or on the web;
- extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
- detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
- extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…
During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.
Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.
Improving Entity Resolution with Global Constraints by Jim Gemmell, Benjamin I. P. Rubinstein, and Ashok K. Chandra.
Abstract:
Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubs-authorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.
Relies on entity resolution that has been performed in another context. I rather like that, as opposed to starting at ground zero.
I was amused that “adult titles” were excluded from the data set. I don’t have the numbers right off hand but “adult titles” account for a large percentage of movie income. Not unlike using stock market data but excluding all finance industry stocks. Seems incomplete.
The First International Workshop on Entity-Oriented Search (EOS)
Important Dates
Submissions due: June 10, 2011
Notification of acceptance: June 25, 2011
Camera-ready submission: July 1, 2011 (provisional, awaiting confirmation)
Workshop date: July 28, 2011
From the website:
Workshop Theme
Many user information needs concern entities: people, organizations, locations, products, etc. These are better answered by returning specific objects instead of just any type of documents. Both commercial systems and the research community are displaying an increased interest in returning “objects”, “entities”, or their properties in response to a user’s query. While major search engines are capable of recognizing specific types of objects (e.g., locations, events, celebrities), true entity search still has a long way to go.
Entity retrieval is challenging as “objects” unlike documents, are not directly represented and need to be identified and recognized in the mixed space of structured and unstructured Web data. While standard document retrieval methods applied to textual representations of entities do seem to provide reasonable performance, a big open question remains how much influence the entity type should have on the ranking algorithms developed.
Avoiding repeated document searching by successive users will require identification as suggested here. Sub-document addressing and retrieval of portions of documents is another aspect to the entity issue.
Lars Marius Garshol on Duke 0.1:
Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.
Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.
The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.
If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.
I will look around for sample data files.
Revealing the true challenges in fighting bank fraud
From the Infoglde blog:
The results of the survey are currently being compiled for general release, but it was extremely interesting to learn that the key challenges of fraud investigations include:
1. the inability to access data due to privacy concerns
2. a lack of real-time high performance data searching engine
3. and an inability to cross-reference and discover relationships between suspicious entities in different databases.
For regular readers of this blog, it comes as no surprise that identity resolution and entity analytics technology provides a solution to those challenges. An identity resolution engine glides across the different data within (or perhaps even external to) a bank’s infrastructure, delivering a view of possible identity matches and non-obvious relationships or hidden links between those identities… despite variations in attributes and/or deliberate attempts to deceive. (emphasis added)
It being an Infoglide blog, guess who they think has an identity resolution engine?
I looked at the data sheet on their Identity Resolution Engine.
I have a question:
If two separate banks are using “Identity Resolution Engine” have built up data mappings, on what basis do I merge those mappings, assuming there are name conflicts in the data mappings as well as in the data proper?
In an acquisition, for example, I should be able to leverage existing data mappings.