Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 4, 2013

Duke 1.0 Release!

Filed under: Duke,Entity Resolution,Record Linkage — Patrick Durusau @ 9:52 am

Duke 1.0 Release!

From the project page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).

Features

  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.

Excellent news on the data depulication front!

And for topic map authors as well (see the examples).

Kudos to Lars Marius Garshol!

December 11, 2012

Developing CODE for a Research Database

Filed under: Entity Extraction,Entity Resolution,Linked Data — Patrick Durusau @ 8:19 pm

Developing CODE for a Research Database by Ian Armas Foster.

From the post:

The fact that there are a plethora of scientific papers readily available online would seem helpful to researchers. Unfortunately, the truth is that the volume of these articles has grown such that determining which information is relevant to a specific project is becoming increasingly difficult.

Austrian and German researchers are thus developing CODE, or Commercially Empowered Linked Open Data Ecosystems in Research, to properly aggregate research data from its various forms, such as PDFs of academic papers and data tables upon which those papers are based, into a single system. The project is in a prototype stage, with the goal being to integrate all forms into one platform by the project’s second year.

The researchers from the University of Passau in Germany and the Know-Center in Graz, Austria explored the challenges to CODE and how the team intends to deal with those challenges in this paper. The goal is to meliorate the research process by making it easier to not only search for both text and numerical data in the same query but also to use both varieties in concert. The basic architecture for the project is shown below.

Stop me if you have heard this one before: “There was this project that was going to disambiguate entities and create linked data….”

I would be the first one to cheer if such a project were successful. But, a few paragraphs in a paper, given the long history of entity resolution and its difficulties, isn’t enough to raise my hopes.

You?

December 1, 2012

A Consumer Electronics Named Entity Recognizer using NLTK [Post-Authoring ER?]

Filed under: Entity Resolution,Named Entity Mining,NLTK — Patrick Durusau @ 8:34 pm

A Consumer Electronics Named Entity Recognizer using NLTK by Sujit Pal.

From the post:

Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn’s Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.

The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.

The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.

Should help with mining data for “entities” (read “subjects” in the topic map sense) for addition to your topic map.

I did puzzle over the suggestion for improvement that reads:

Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.

Post-authoring reference resolution might benefit from that approach.

But, if references were resolved by authors during the creation of a text, such as the insertion of Wikipedia references for entities, a different result would be obtained.

In those cases, assuming the author of a text is identified, they can be associated with a particular set of reference resolutions.

September 2, 2012

Entity disambiguation using semantic networks

Filed under: Entity Resolution,Graphs,Networks,Semantic Graph — Patrick Durusau @ 7:20 pm

Entity disambiguation using semantic networks by Jorge H. Román, Kevin J. Hulin, Linn M. Collins and James E. Powell. Journal of the American Society for Information Science and Technology, published 29 August 2012.

Abstract:

A major stumbling block preventing machines from understanding text is the problem of entity disambiguation. While humans find it easy to determine that a person named in one story is the same person referenced in a second story, machines rely heavily on crude heuristics such as string matching and stemming to make guesses as to whether nouns are coreferent. A key advantage that humans have over machines is the ability to mentally make connections between ideas and, based on these connections, reason how likely two entities are to be the same. Mirroring this natural thought process, we have created a prototype framework for disambiguating entities that is based on connectedness. In this article, we demonstrate it in the practical application of disambiguating authors across a large set of bibliographic records. By representing knowledge from the records as edges in a graph between a subject and an object, we believe that the problem of disambiguating entities reduces to the problem of discovering the most strongly connected nodes in a graph. The knowledge from the records comes in many different forms, such as names of people, date of publication, and themes extracted from the text of the abstract. These different types of knowledge are fused to create the graph required for disambiguation. Furthermore, the resulting graph and framework can be used for more complex operations.

To give you a sense of the author’s approach:

A semantic network is the underlying information representation chosen for the approach. The framework uses several algorithms to generate subgraphs in various dimensions. For example: a person’s name is mapped into a phonetic dimension, the abstract is mapped into a conceptual dimension, and the rest are mapped into other dimensions. To map a name into its phonetic representation, an algorithm translates the name of a person into a sequence of phonemes. Therefore, two names that are written differently but pronounced the same are considered to be the same in this dimension. The “same” qualification in one of these dimensions is then used to identify potential coreferent entities. Similarly, an algorithm for generating potential alternate spellings of a name has been used to find entities for comparison with similarly spelled names by computing word distance.

The hypothesis underlying our approach is that coreferent entities are strongly connected on a well-constructed graph.

Question: What if the nodes to which the coreferent entities are strongly connected are themselves ambiguous?

August 1, 2012

Swoosh: a generic approach to entity resolution

Filed under: Deduplication,Entity Resolution — Patrick Durusau @ 7:53 pm

Swoosh: a generic approach to entity resolution by Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.

Do you remember Swoosh?

I saw it today in Five Short Links by Pete Warden.

Abstract:

We consider the Entity Resolution (ER) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of the “features” ( e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

It sounds familiar.

Running some bibliographic searches, looks like 100 references since 2011. That’s going to take a while! But it all looks like good stuff.

July 8, 2012

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Filed under: Duplicates,Entity Resolution,Record Linkage — Patrick Durusau @ 4:59 pm

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen.

In the Foreword, William E. Winkler (U. S. Census Bureau and dean of record linkage), writes:

Within this framework of historical ideas and needed future work, Peter Christen’s monograph serves as an excellent compendium of the best existing work by computer scientists and others. Individuals can use the monograph as a basic reference to which they can gain insight into the most pertinent record linkage ideas. Interested researchers can use the methods and observations as building blocks in their own work. What I found very appealing was the high quality of the overall organization of the text, the clarity of the writing, and the extensive bibliography of pertinent papers. The numerous examples are quite helpful because they give real insight into a specific set of methods. The examples, in particular, prevent the researcher from going down some research directions that would often turn out to be dead ends.

I saw the alert for this volume today so haven’t had time to acquire and read it.

Given the high praise from Winkler, I expect it to be a pleasure to read and use.

June 3, 2012

Semi-Supervised Named Entity Recognition:… [Marketing?]

Filed under: Entities,Entity Extraction,Entity Resolution,Marketing — Patrick Durusau @ 3:40 pm

Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision by David Nadeau (PhD Thesis, University of Ottawa, 2007).

Abstract:

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

Nadeau demonstrates the successful construction of a Named Entity Recognition (NER) system using a few supplied examples for each entity.

But what explains the lack of annotation where the entities are well known? The King James Bible? Search for “Joseph.” We know not all of the occurrences of “Joseph” represent the same entity.

Looking at the client list for Infoglutton, is there a lack of interest in named entity recognition?

Have we focused on techniques and issues that interest us, and then, as an afterthought, tried to market the results to consumers?

A Survey of Named Entity Recognition and Classification

Filed under: Entities,Entity Extraction,Entity Resolution — Patrick Durusau @ 3:40 pm

A Survey of Named Entity Recognition and Classification by David Nadeau, Satoshi Sekine (Journal of Linguisticae Investigationes 30:1 ; 2007)

Abstract:

The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. We present here a survey of fifteen years of research in the NERC field, from 1991 to 2006. While early systems were making use of handcrafted rule-based algorithms, modern systems most often resort to machine learning techniques. We survey these techniques as well as other critical aspects of NERC such as features and evaluation methods. It was indeed concluded in a recent conference that the choice of features is at least as important as the choice of technique for obtaining a good NERC system (E. Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated and compared is essential to progress in the field. To the best of our knowledge, NERC features, techniques, and evaluation methods have not been surveyed extensively yet. The first section of this survey presents some observations on published work from the point of view of activity per year, supported languages, preferred textual genre and domain, and supported entity types. It was collected from the review of a hundred English language papers sampled from the major conferences and journals. We do not claim this review to be exhaustive or representative of all the research in all languages, but we believe it gives a good feel for the breadth and depth of previous work. Section 2 covers the algorithmic techniques that were proposed for addressing the NERC task. Most techniques are borrowed from the Machine Learning (ML) field. Instead of elaborating on techniques themselves, the third section lists and classifies the proposed features, i.e., descriptions and characteristic of words for algorithmic consumption. Section 4 presents some of the evaluation paradigms that were proposed throughout the major forums. Finally, we present our conclusions.

A bit dated now (2007) but a good starting point for named entity recognition research. The bibliography runs a little over four (4) pages and running those citations forward should capture most of the current research.

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity

Filed under: Entity Extraction,Entity Resolution,Named Entity Mining — Patrick Durusau @ 3:38 pm

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity by David Nadeau, Peter D. Turney and Stan Matwin.

Abstract:

In this paper, we propose a named-entity recognition (NER) system that addresses two major limitations frequently discussed in the field. First, the system requires no human intervention such as manually labeling training data or creating gazetteers. Second, the system can handle more than the three classical named-entity types (person, location, and organization). We describe the system’s architecture and compare its performance with a supervised system. We experimentally evaluate the system on a standard corpus, with the three classical named-entity types, and also on a new corpus, with a new named-entity type (car brands).

The authors confide successful application of their techniques to more than 50 named-entity types.

They also recite heuristics that they apply to texts during the mining process.

Is there a common repository of observations or heuristics for mining texts? Just curious.

Source code for the project: http://balie.sourceforge.net.

Answer to the question I just posed?

A Resource-Based Method for Named Entity Extraction and Classification

Filed under: Entities,Entity Extraction,Entity Resolution,Law,Named Entity Mining — Patrick Durusau @ 3:37 pm

A Resource-Based Method for Named Entity Extraction and Classification by Pablo Gamallo and Marcos Garcia. (Lecture Notes in Computer Science, vol. 7026, Springer-Verlag, 610-623. ISNN: 0302-9743).

Abstract:

We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.

Of particular interest if you are interested in adding NEC resources to the FreeLing project.

The introduction starts off:

Named Entity Recognition and Classification (NERC) is the process of identifying and classifying proper names of people, organizations, locations, and other Named Entities (NEs) within text.

Curious, what happens if you don’t have a “named” entity? That is an entity mentioned in the text but that doesn’t (yet) have a proper name?

Thinking of legal texts where some provision may apply to all corporations that engage in activity Y and that have a gross annual income in excess of amount X.

I may want to “recognize” that entity so I can then put a name with that entity.

May 31, 2012

Joint International Workshop on Entity-oriented and Semantic Search

Filed under: Entity Extraction,Entity Resolution,LOD,Semantic Search,Semantics — Patrick Durusau @ 7:32 am

1st Joint International Workshop on Entity-oriented and Semantic Search (JIWES) 2012

Important Dates:

  • Submissions Due: July 2, 2012
  • Notification of Acceptance: July 23, 2012
  • Camera Ready: August 1, 2012
  • Workshop date: August 16th, 2012

Located at the 35th ACM SIGIR Conference, Portland, Oregon, USA, August 12–16, 2012.

From the homepage of the workshop:

About the Workshop:

The workshop encompasses various tasks and approaches that go beyond the traditional bag-of-words paradigm and incorporate an explicit representation of the semantics behind information needs and relevant content. This kind of semantic search, based on concepts, entities and relations between them, has attracted attention both from industry and from the research community. The workshop aims to bring people from different communities (IR, SW, DB, NLP, HCI, etc.) and backgrounds (both academics and industry practitioners) together, to identify and discuss emerging trends, tasks and challenges. This joint workshop is a sequel of the Entity-oriented Search and Semantic Search Workshop series held at different conferences in previous years.

Topics

The workshop aims to gather all works that discuss entities along three dimensions: tasks, data and interaction. Tasks include entity search (search for entities or documents representing entities), relation search (search entities related to an entity), as well as more complex tasks (involving multiple entities, spatio-temporal relations inclusive, involving multiple queries). In the data dimension, we consider (web/enterprise) documents (possibly annotated with entities/relations), Linked Open Data (LOD), as well as user generated content. The interaction dimension gives room for research into user interaction with entities, also considering how to display results, as well as whether to aggregate over multiple entities to construct entity profiles. The workshop especially encourages submissions on the interface of IR and other disciplines, such as the Semantic Web, Databases, Computational Linguistics, Data Mining, Machine Learning, or Human Computer Interaction. Examples of topic of interest include (but are not limited to):

  • Data acquisition and processing (crawling, storage, and indexing)
  • Dealing with noisy, vague and incomplete data
  • Integration of data from multiple sources
  • Identification, resolution, and representation of entities (in documents and in queries)
  • Retrieval and ranking
  • Semantic query modeling (detecting, modeling, and understanding search intents)
  • Novel entity-oriented information access tasks
  • Interaction paradigms (natural language, keyword-based, and hybrid interfaces) and result representation
  • Test collections and evaluation methodology
  • Case studies and applications

We particularly encourage formal evaluation of approaches using previously established evaluation benchmarks: Semantic Search Challenge 2010, Semantic Search Challenge 2011, TREC Entity Search Track.

All workshops are special to someone. This one sounds more special than most. Collocated with the ACM SIGIR 2012 meeting. Perhaps that’s the difference.

May 9, 2012

GATE Teamware: Collaborative Annotation Factories (HOT!)

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

  • Loading document collections (a “corpus” or “corpora”)
  • Creating re-usable project templates
  • Initiating projects based on templates
  • Assigning project roles to specific users
  • Monitoring progress and various project statistics in real time
  • Reporting of project status, annotator activity and statistics
  • Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

March 3, 2012

Populating the Semantic Web…

Filed under: Data Mining,Entity Extraction,Entity Resolution,RDF,Semantic Web — Patrick Durusau @ 7:28 pm

Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs by Kate Bryne.

I ran across this while looking for RDF graph material today. Delighted to find someone interested in the problem of what do we do with existing data, even if new data is in some semantic web format?

Abstract:

The Semantic Web promises a way of linking distributed information at a granular level by interconnecting compact data items instead of complete HTML pages. New data is gradually being added to the SemanticWeb but there is a need to incorporate existing knowledge. This thesis explores ways to convert a coherent body of information from various structured and unstructured formats into the necessary graph form. The transformation work crosses several currently active disciplines, and there are further research questions that can be addressed once the graph has been built.

Hybrid databases, such as the cultural heritage one used here, consist of structured relational tables associated with free text documents. Access to the data is hampered by complex schemas, confusing terminology and difficulties in searching the text effectively. This thesis describes how hybrid data can be unified by assembly into a graph. A major component task is the conversion of relational database content to RDF. This is an active research field, to which this work contributes by examining weaknesses in some existing methods and proposing alternatives.

The next significant element of the work is an attempt to extract structure automatically from English text using natural language processing methods. The first claim made is that the semantic content of the text documents can be adequately captured as a set of binary relations forming a directed graph. It is shown that the data can then be grounded using existing domain thesauri, by building an upper ontology structure from these. A schema for cultural heritage data is proposed, intended to be generic for that domain and as compact as possible.

Another hypothesis is that use of a graph will assist retrieval. The structure is uniform and very simple, and the graph can be queried even if the predicates (or edge labels) are unknown. Additional benefits of the graph structure are examined, such as using path length between nodes as a measure of relatedness (unavailable in a relational database where there is no equivalent concept of locality), and building information summaries by grouping the attributes of nodes that share predicates.

These claims are tested by comparing queries across the original and the new data structures. The graph must be able to answer correctly queries that the original database dealt with, and should also demonstrate valid answers to queries that could not previously be answered or where the results were incomplete.

This will take some time to read but it looks quite enjoyable.

February 24, 2012

Entity Matching for Semistructured Data in the Cloud

Filed under: Cloud Computing,Entities,Entity Resolution — Patrick Durusau @ 5:03 pm

Entity Matching for Semistructured Data in the Cloud by Marcus Paradies.

From the slides:

Main Idea

  • Use MapReduce and ChuQL to process semistructured data
  • Use a search-based blocking to generate candidate pairs
  • Apply similarity functions to candidate pairs within a block

Uses two of my favorite sources, CiteSeer and Wikipedia.

Looks like the start of an authoring stage of topic map work flow to me. You?

January 16, 2012

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings

Filed under: Conferences,Entities,Entity Extraction,Entity Resolution,Search Data,Searching — Patrick Durusau @ 2:32 pm

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings (PDF file)

There you will find:

Session 1:

  • High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing by Zhengzhong Liu, Qin Lu, and Jian Xu (The Hong Kong Polytechnic University)
  • Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi-Modal CRF Model by Richard Tzong-Han Tsai (Yuan Ze University, Taiwan)
  • LADS: Rapid Development of a Learning-To-Rank Based Related Entity Finding System using Open Advancement by Bo Lin, Kevin Dela Rosa, Rushin Shah, and Nitin Agarwal (Carnegie Mellon University)
  • Finding Support Documents with a Logistic Regression Approach by Qi Li and Daqing He (University of Pittsburgh)
  • The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data by Stephane Campinas (National University of Ireland), Diego Ceccarelli (University of Pisa), Thomas E. Perry (National University of Ireland), Renaud Delbru (National University of Ireland), Krisztian Balog (Norwegian University of Science and Technology) and Giovanni Tummarello (National University of Ireland)

Session 2

  • Cross-Domain Bootstrapping for Named Entity Recognition by Ang Sun and Ralph Grishman (New York University)
  • Semi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery by Raymond Y.K. Lau and Wenping Zhang (City University of Hong Kong)
  • Unsupervised Related Entity Finding by Olga Vechtomova (University of Waterloo)

Session 3

  • Learning to Rank Homepages For Researcher-Name Queries by Sujatha Das, Prasenjit Mitra, and C. Lee Giles (The Pennsylvania State University)
  • An Evaluation Framework for Aggregated Temporal Information Extraction by Enrique Amigó, (UNED University), Javier Artiles (City University of New York), Heng Hi (City University of New York) and Qi Li (City University of New York)
  • Entity Search Evaluation over Structured Web Data by Roi Blanco (Yahoo! Research), Harry Halpin (University of Edinburgh), Daniel M. Herzig (Karlsruhe Institute of Technology), Peter Mika (Yahoo! Research), Jeffrey Pound (University of Waterloo), Henry S. Thompson (University of Edinburgh) and Thanh Tran Duc (Karlsruhe Institute of Technology)

A good start on what promises to be a strong conference series on entity-oriented search.

January 13, 2012

Duke 0.4

Filed under: Deduplication,Entity Resolution,Record Linkage — Patrick Durusau @ 8:17 pm

Duke 0.4

New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.

From the release notes:

This version of Duke introduces:

  • Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
  • In-memory data source added (thanks to FMitzlaff).
  • Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
  • Record linkage API refactored slightly to be more flexible (with FMitzlaff).
  • Added utilities for building equivalence classes from Duke output.
  • Made the XML config loader more robust.
  • Added a special cleaner for English person names.
  • Fixed bug in NumericComparator ( issue 66 )
  • Uses own Lucene query parser to avoid issues with search strings.
  • Upgraded to Lucene 3.5.0.
  • Added many more tests.
  • Many small bug fixes to core, NTriples reader, ec.

BTW, the documentation is online only: http://code.google.com/p/duke/wiki/GettingStarted.

November 28, 2011

Oyster Software On Sourceforge!

Filed under: Entity Resolution,Oyster — Patrick Durusau @ 7:11 pm

Some of you may recall my comments on Oyster: A Configurable ER Engine, a configurable entity resolution engine.

Software wasn’t available when that post was written but it is now, along with work on a GUI for the software.

Oyster Entity Resolution (SourceForge).

BTW, the “complete” download does not include the GUI.

It is important to also download the GUI for two reasons:

1) It is the only documentation for the project, and

2) The GUI generates the XML files needed to use the Oyster software.

There is no documentation of the XML format (I asked). As in a schema, etc.

Contributing a schema to the project would be a nice thing to do.

Surrogate Learning

Filed under: Entity Resolution,Linguistics,Record Linkage — Patrick Durusau @ 7:04 pm

Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.

Abstract:

We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

The two “real world” applications are ones you are likely to encounter:

First:

Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.

Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)

Second:

Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)

This test was against 700000 financial news records.

Both tests were quite successful.

Surrogate learning looks interesting for a range of NLP applications.

Template-Based Information Extraction without the Templates

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.

Abstract:

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

November 27, 2011

Concord: A Tool That Automates the Construction of Record Linkage Systems

Concord: A Tool That Automates the Construction of Record Linkage Systems by Christopher Dozier, Hugo Molina Salgado, Merine Thomas, Sriharsha Veeramachaneni, 2010.

From the webpage:

Concord is a system provided by Thomson Reuters R&D to enable the rapid creation of record resolution systems (RRS). Concord allows software developers to interactively configure a RRS by specifying match feature functions, master record retrieval blocking functions, and unsupervised machine learning methods tuned to a specific resolution problem. Based on a developer’s configuration process, the Concord system creates a Java based RRS that generates training data, learns a matching model and resolves record information contained in files of the same types used for training and configuration.

A nice way to start off the week! Deeply interesting paper and a new name for record linkage.

Several features of Concord that merit your attention (among many):

A choice of basic comparison operations with the ability to extend seems like a good design to me. No sense overwhelming users with all the general comparison operators, to say nothing of the domain specific ones.

The blocking functions, which operate just as you suspect, narrows the potential set of records for matching down, is also appealing. Sometimes you may be better at saying what doesn’t match than what does. This gives you two bites at a successful match.

Surrogate learning, although I have located the paper cited on this subject and will be covering it in another post.

I have written to ThomsonReuters inquiring about availability of Concord, its ability to interchange mapping settings between instances of Concord or beyond. Will update when I hear back from them.

October 22, 2011

Introducing fise, the Open Source RESTful Semantic Engine

Filed under: Entity Extraction,Entity Resolution,Language,Semantics,Taxonomy — Patrick Durusau @ 3:16 pm

Introducing fise, the Open Source RESTful Semantic Engine

From the post:

fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.

As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.

Presenting the software in Q/A form:

What is a Semantic Engine?

A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.

Current semantic engines can typically:

  • categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the  Business, Lifestyle, Technology categories? …);
  • suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
  • find related documents in the local database or on the web;
  • extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
  • detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
  • extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…

During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.

Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.

September 6, 2011

Improving Entity Resolution with Global Constraints

Filed under: Data Integration,Data Mining,Entity Resolution — Patrick Durusau @ 7:00 pm

Improving Entity Resolution with Global Constraints by Jim Gemmell, Benjamin I. P. Rubinstein, and Ashok K. Chandra.

Abstract:

Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubs-authorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

Relies on entity resolution that has been performed in another context. I rather like that, as opposed to starting at ground zero.

I was amused that “adult titles” were excluded from the data set. I don’t have the numbers right off hand but “adult titles” account for a large percentage of movie income. Not unlike using stock market data but excluding all finance industry stocks. Seems incomplete.

June 1, 2011

Workshop on Entity-Oriented Search (EOS) – Beijing

Filed under: Conferences,Entity Extraction,Entity Resolution,Searching — Patrick Durusau @ 6:51 pm

The First International Workshop on Entity-Oriented Search (EOS)

Important Dates

Submissions due: June 10, 2011
Notification of acceptance: June 25, 2011
Camera-ready submission: July 1, 2011 (provisional, awaiting confirmation)
Workshop date: July 28, 2011

From the website:

Workshop Theme

Many user information needs concern entities: people, organizations, locations, products, etc. These are better answered by returning specific objects instead of just any type of documents. Both commercial systems and the research community are displaying an increased interest in returning “objects”, “entities”, or their properties in response to a user’s query. While major search engines are capable of recognizing specific types of objects (e.g., locations, events, celebrities), true entity search still has a long way to go.

Entity retrieval is challenging as “objects” unlike documents, are not directly represented and need to be identified and recognized in the mixed space of structured and unstructured Web data. While standard document retrieval methods applied to textual representations of entities do seem to provide reasonable performance, a big open question remains how much influence the entity type should have on the ranking algorithms developed.

Avoiding repeated document searching by successive users will require identification as suggested here. Sub-document addressing and retrieval of portions of documents is another aspect to the entity issue.

May 19, 2011

Duke 0.1 Release

Filed under: Duke,Entity Resolution,Lucene,Record Linkage — Patrick Durusau @ 3:28 pm

Duke 0.1 Release

Lars Marius Garshol on Duke 0.1:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.

Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.

If you have questions, please contact the developer, Lars Marius Garshol, larsga at garshol.priv.no.

I will look around for sample data files.

May 6, 2011

Revealing the true challenges in fighting bank fraud

Filed under: Associations,Entity Resolution — Patrick Durusau @ 12:30 pm

Revealing the true challenges in fighting bank fraud

From the Infoglde blog:

The results of the survey are currently being compiled for general release, but it was extremely interesting to learn that the key challenges of fraud investigations include:

1. the inability to access data due to privacy concerns

2. a lack of real-time high performance data searching engine

3. and an inability to cross-reference and discover relationships between suspicious entities in different databases.

For regular readers of this blog, it comes as no surprise that identity resolution and entity analytics technology provides a solution to those challenges. An identity resolution engine glides across the different data within (or perhaps even external to) a bank’s infrastructure, delivering a view of possible identity matches and non-obvious relationships or hidden links between those identities… despite variations in attributes and/or deliberate attempts to deceive. (emphasis added)

It being an Infoglide blog, guess who they think has an identity resolution engine?

I looked at the data sheet on their Identity Resolution Engine.

I have a question:

If two separate banks are using “Identity Resolution Engine” have built up data mappings, on what basis do I merge those mappings, assuming there are name conflicts in the data mappings as well as in the data proper?

In an acquisition, for example, I should be able to leverage existing data mappings.

April 29, 2011

Duolingo: The Next Chapter in Human Communication

Duolingo: The Next Chapter in Human Communication

By one of the co-inventors of CAPTCHA and reCAPTCHA, Luis von Ahn, so his arguments should give us pause.

Luis wants to address the problem of translating the web into multiple languages.

Yes, you heard that right, translate the web into multiple languages.

Whatever you think now, watch the video and decide if you still feel the same way.

My question is how to adapt his techniques to subject identification?

April 28, 2011

Dataset linkage recommendation on the Web of Data

Filed under: Conferences,Entity Resolution,Linked Data,LOD — Patrick Durusau @ 3:18 pm

Dataset linkage recommendation on the Web of Data by Martijn van der Plaat (Master thesis).

Abstract:

We address the problem of, given a particular dataset, which candidate dataset(s) from the Web of Data have the highest chance of holding co-references, in order to increase the efficiency of coreference resolution. Currently, data publishers manually discover and select the right dataset to perform a co-reference resolution. However, in the near future the size of the Web of Data will be such that data publishers can no longer determine which datasets are candidate to map to. A solution for this problem is finding a method to automatically recommend a list of candidate datasets from the Web of Data and present this to the data publisher as an input for the mapping.

We proposed two solutions to perform the dataset linkage recommendation. The general idea behind our solutions is predicting the chance a particular dataset on the Web of Data holds co-references with respect to the dataset from the data publisher. This prediction is done by generating a profile for each dataset from the Web of Data. A profile is meta-data that represents the structure of a dataset, in terms of used vocabularies, class types, and property types. Subsequently, dataset profiles that correspond with the dataset profile from the data publisher, get a specific weight value. Datasets with the highest weight values have the highest chance of holding co-references.

A useful exercise but what happens when data sets have inconsistent profiles from different sources?

And for all the drum banging, only a very tiny portion of all available datasets are part of Linked Data.

How do we evaluate the scalability of such a profiling technique?

February 23, 2011

Big Oil and Big Data

Filed under: Entity Resolution,Marketing,Topic Maps — Patrick Durusau @ 11:47 am

Big Oil and Big Data Mike Betron, Marketing Director of Infoglide says that it is becoming feasible to mine “big data” and to exploit “entity resolution.”

Those who want to exploit the availability of big data have another powerful tool at their disposal – entity resolution. The ability to search across multiple databases with disparate forms residing in different locations can tame large amounts of data very quickly, efficiently resolving multiple entities into one and finding hidden connections without human intervention in many application areas, including detecting financial fraud.

By exploiting advancing technologies like entity resolution, systems can give organizations a distinct competitive advantage over those who lag in technology adoption.

I have to quibble about the …without human intervention… part, although I am quite happy with augmented human supervision.

Well, that and the implication that entity resolution is a new technology. In various guises, entity resolution has been in use for decades in medical epidemiology, for example.

Preventing subject identifications from languishing in reports, summaries, and the other information debris of a modern organization. So that organizational memories, documented and accessible organization memories prosper and grow, now that would be something different. (It could also be called a topic map.)

February 9, 2011

Oyster: A Configurable ER Engine

Filed under: Entity Resolution,Record Linkage,Semantic Web,Subject Identity — Patrick Durusau @ 4:55 pm

Oyster: A Configurable ER Engine

John Talburt writes a very enticing overview of an entity resolution engine he calls Oyster.

From the post:

OYSTER will be unique among freely available systems in that it supports identity management and identity capture. This allows the user to configure OYSTER to not only run as a typical merge-purge/record linking system, but also as an identity capture and identity resolution system. (Emphasis added)

Yes, record linking we have had since the late 1950’s in a variety of guises and over twenty (20) different names that I know of.

Adding identity management and identity capture (FYI, SW uses universal identifier assignment) will be something truly different.

As in topic map different.

Will be keeping a close watch on this project and suggest that you do the same.

December 24, 2010

idMesh: Graph-Based Disambiguation of Linked Data

Filed under: Entity Resolution,Linked Data,Topic Maps — Patrick Durusau @ 9:21 am

idMesh: Graph-Based Disambiguation of Linked Data Authors: Philippe Cudré-Mauroux, Parisa Haghani, Michael Jost, Karl Aberer, Hermann de Meer

Abstract:

We tackle the problem of disambiguating entities on the Web. We propose a user-driven scheme where graphs of entities – represented by globally identifiable declarative artifacts – self-organize in a dynamic and probabilistic manner. Our solution has the following two desirable properties: i) it lets end-users freely define associations between arbitrary entities and ii) it probabilistically infers entity relationships based on uncertain links using constraint-satisfaction mechanisms. We outline the interface between our scheme and the current data Web, and show how higher-layer applications can take advantage of our approach to enhance search and update of information relating to online entities. We describe a decentralized infrastructure supporting efficient and scalable entity disambiguation and demonstrate the practicability of our approach in a deployment over several hundreds of machines.

Interesting paper but disappointing in that indication of equivalence between links is the only option for indicating equivalence of entities.

While that is true for Linked Data and the Semantic Web in general (see OWL:sameAs), topic maps has long supported a more robust, declarative approach.

The Topic Maps Data Model (TMDM) defines default merging for topics, but leaves open the specification of additional bases for merging.

The Topic Maps Reference Model (TMRM) does not define any merging rules at all but enables legends to make their own choices with regard to bases for merging.

The problem engendered by indicating equivalence by use of IRIs is just that, all you have is an indication of equivalence.

There is no indication of why, on what basis, etc., two (or more), IRIs are thought to indicate the same subject.

Which means there is no basis on which to compare them with other representatives for the same subject.

As well as no basis for perhaps re-using that declaration of equivalence.

« Newer PostsOlder Posts »

Powered by WordPress