Archive for the ‘Record Linkage’ Category

Open data quality – Subject Identity By Another Name

Thursday, June 8th, 2017

Open data quality – the next shift in open data? by Danny Lämmerhirt and Mor Rubinstein.

From the post:

Some years ago, open data was heralded to unlock information to the public that would otherwise remain closed. In the pre-digital age, information was locked away, and an array of mechanisms was necessary to bridge the knowledge gap between institutions and people. So when the open data movement demanded “Openness By Default”, many data publishers followed the call by releasing vast amounts of data in its existing form to bridge that gap.

To date, it seems that opening this data has not reduced but rather shifted and multiplied the barriers to the use of data, as Open Knowledge International’s research around the Global Open Data Index (GODI) 2016/17 shows. Together with data experts and a network of volunteers, our team searched, accessed, and verified more than 1400 government datasets around the world.

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

As the Open Data Handbook states, these emerging open data infrastructures resemble the myth of the ‘Tower of Babel’: more information is produced, but it is encoded in different languages and forms, preventing data publishers and their publics from communicating with one another. What makes data usable under these circumstances? How can we close the information chain loop? The short answer: by providing ‘good quality’ open data.

Congratulations to Open Knowledge International on re-discovering the ‘Tower of Babel’ problem that prevents easy re-use of data.

Contrary to Lämmerhirt and Rubinstein’s claim, barriers have not “…shifted and multiplied….” More accurate to say Lämmerhirt and Rubinstein have experienced what so many other researchers have found for decades:

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

The record linkage community, think medical epidemiology, has been working on aspects of this problem since the 1950’s at least (under that name). It has a rich and deep history, focused in part on mapping diverse data sets to a common representation and then performing analysis upon the resulting set.

A common omission in record linkage is to capture in discoverable format, the basis for mapping of the diverse records to a common format. That is subjects represented by “…uncommon signs or codes that are in the worst case only understandable to their producer,” that Lämmerhirt and Rubinstein complain of, although signs and codes need not be “uncommon” to be misunderstood by others.

To their credit, unlike RDF and the topic maps default, record linkage has long recognized that identification consists of multiple parts and not single strings.

Topic maps, at least at their inception, was unaware of record linkage and the vast body of research done under that moniker. Topic maps were bitten by the very problem they were seeking to solve. That being a subject, could be identified many different ways and information discovered by others about that subject, could be nearby but undiscoverable/unknown.

Rather than building on the experience with record linkage, topic maps, at least in the XML version, defaulted to relying on URLs to identify the location of subjects (resources) and/of identifying subjects (identifiers). Avoiding the Philosophy 101 mistakes of RDF, confusing locators and identifiers + refusing to correct the confusion, wasn’t enough for topic maps to become widespread. One suspects in part because topic maps were premised on creating more identifiers for subjects which already had them.

Imagine that your company has 1,000 employees and in order to use a new system, say topic maps, everyone must get a new name. Can’t use the old one. Do you see a problem? Now multiple that by every subject anyone in your company wants to talk about. We won’t run out of identifiers but your staff will certainly run out of patience.

Robust solutions to the open data ‘Tower of Babel’ issue will include the use of multi-part identifications extant in data stores, dynamic creation of multi-part identifications when necessary (note, no change to existing data store), discoverable documentation of multi-part identifications and their mappings, where syntax and data models are up to the user of data.

That sounds like a job for XQuery to me.


ACHE Focused Crawler

Tuesday, August 9th, 2016

ACHE Focused Crawler

From the webpage:

ACHE is an implementation of a focused crawler. A focused crawler is a web crawler that collects Web pages that satisfy some specific property. ACHE differs from other crawlers in the sense that it includes page classifiers that allows it to distinguish between relevant and irrelevant pages in a given domain. The page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a sophisticated machine-learned classification model. ACHE also includes link classifiers, which allows it decide the best order in which the links should be downloaded in order to find the relevant content on the web as fast as possible, at the same time it doesn’t waste resources downloading irrelevant content.


The inclusion of machine learning (Weka) and robust indexing (ElasticSearch) means this will take more than a day or two to explore.

Certainly well suited to exploring all the web accessible resources on narrow enough topics.

I was thinking about doing a “9 Million Pages of Donald Trump,” (think Nine Billion Names of God) but a quick sanity check showed there are already more than 230 million such pages.

Perhaps by the election I could produce “9 Million Pages With Favorable Comments About Donald Trump.” Perhaps if I don’t dedupe the pages found by searching it would go that high.

Other topics for comprehensive web searching come to mind?

PS: The many names of record linkage come to mind. I think I have thirty (30) or so.

Record Linkage (Think Topic Maps) In War Crimes Investigations

Thursday, June 9th, 2016

Machine learning for human rights advocacy: Big benefits, serious consequences by Megan Price.

Megan is the executive director of the Human Rights Data Analysis Group (HRDAG), an organization that applies data science techniques to documenting violence and potential human rights abuses.

I watched the video expecting extended discussion of machine learning, only to find that our old friend, record linkage, was mentioned repeatedly during the presentation. Along with some description of the difficulty of reconciling lists of identified casualties in war zones.

Not to mention the task of estimating casualties that will never appear by any type of reporting.

When Megan mentioned record linkage I was hooked and stayed for the full presentation. If you follow the link to Human Rights Data Analysis Group (HRDAG), you will find a number of publications, concerning the scientific side of their work.

Oh, record linkage is a technique used originally in epidemiology to “merge*” records from different authorities in order to study the transmission of disease. It dates from the late 1950’s and has been actively developed since then.

Including two complete and independent mathematical models, which arose because terminology differences prevented the second one from discovering the first. There’s a topic map example for you!

Certainly an area where the multiple facets (non-topic map sense) of subject identity would come into play. Not to mention making the merging of lists auditable. (They may already have that capability and I am unaware of it.)

It’s an interesting video and the website even more so.


* One difference between record linkage and topic maps is that the usual record linkage technique maps diverse data into a single representation for processing. That technique loses the semantics associated with the terminology in the original records. Preservation of those semantics may not be your use case, but be aware you are losing data in such a process.

Visualizing Data Loss From Search

Thursday, April 14th, 2016

I used searches for “duplicate detection” (3,854) and “coreference resolution” (3290) in “Ironically, Entity Resolution has many duplicate names” [Data Loss] to illustrate potential data loss in searches.

Here is a rough visualization of the information loss if you use only one of those terms:


If you search for “duplicate detection,” you miss all the articles shaded in blue.

If you search for “coreference resolution,” you miss all the articles shaded in yellow.

Suggestions for improving this visualization?

It is a visualization that could be performed on client’s data, using their search engine/database.

In order to identify the data loss they are suffering now from search across departments.

With the caveat that not all data loss is bad and/or worth avoiding.

Imaginary example (so far): What if you could demonstrate no overlapping of terminology for two vendors for the United States Army and the Air Force. That is no query terms for one returned useful results for the other.

That is a starting point for evaluating the use of topic maps.

While the divergence in terminologies is a given, the next question is: What is the downside to that divergence? What capability is lost due to that divergence?

Assuming you can identify such a capacity, the next question is to evaluate the cost of reducing and/or eliminating that divergence versus the claimed benefit.

I assume the most relevant terms are going to be those internal to customers and/or potential customers.

Interest in working this up into a client prospecting/topic map marketing tool?

Separately I want to note my discovery (you probably already knew about it) of VennDIS: a JavaFX-based Venn and Euler diagram software to generate publication quality figures. Download here. (Apologies, the publication itself if firewalled.)

The export defaults to 800 x 800 resolution. If you need something smaller, edit the resulting image in Gimp.

It’s a testimony to the software that I was able to produce a useful image in less than a day. Kudos to the software!

Some tools for lifting the patent data treasure

Monday, December 15th, 2014

Some tools for lifting the patent data treasure by by Michele Peruzzi and Georg Zachmann.

From the post:

…Our work can be summarized as follows:

  1. We provide an algorithm that allows researchers to find the duplicates inside Patstat in an efficient way
  2. We provide an algorithm to connect Patstat to other kinds of information (CITL, Amadeus)
  3. We publish the results of our work in the form of source code and data for Patstat Oct. 2011.

More technically, we used or developed probabilistic supervised machine-learning algorithms that minimize the need for manual checks on the data, while keeping performance at a reasonably high level.

The post has links for source code and data for these three papers:

A flexible, scaleable approach to the international patent “name game” by Mark Huberty, Amma Serwaah, and Georg Zachmann

In this paper, we address the problem of having duplicated patent applicants’ names in the data. We use an algorithm that efficiently de-duplicates the data, needs minimal manual input and works well even on consumer-grade computers. Comparisons between entries are not limited to their names, and thus this algorithm is an improvement over earlier ones that required extensive manual work or overly cautious clean-up of the names.

A scaleable approach to emissions-innovation record linkage by Mark Huberty, Amma Serwaah, and Georg Zachmann

PATSTAT has patent applications as its focus. This means it lacks important information on the applicants and/or the inventors. In order to have more information on the applicants, we link PATSTAT to the CITL database. This way the patenting behaviour can be linked to climate policy. Because of the structure of the data, we can adapt the deduplication algorithm to use it as a matching tool, retaining all of its advantages.

Remerge: regression-based record linkage with an application to PATSTAT by Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

We further extend the information content in PATSTAT by linking it to Amadeus, a large database of companies that includes financial information. Patent microdata is now linked to financial performance data of companies. This algorithm compares records using multiple variables, learning their relative weights by asking the user to find the correct links in a small subset of the data. Since it is not limited to comparisons among names, it is an improvement over earlier efforts and is not overly dependent on the name-cleaning procedure in use. It is also relatively easy to adapt the algorithm to other databases, since it uses the familiar concept of regression analysis.

Record linkage is a form of merging that originated in epidemiology in the late 1940’s. To “link” (read merge) records across different formats, records were transposed into a uniform format and “linking” characteristics chosen to gather matching records together. A very powerful technique that has been in continuous use and development ever since.

One major different with topic maps is that record linkage has undisclosed subjects, that is the subjects that make up the common format and the association of the original data sets with that format. I assume in many cases the mapping is documented but it doesn’t appear as part of the final work product, thereby rendering the merging process opaque and inaccessible to future researchers. All you can say is “…this is the data set that emerged from the record linkage.”

Sufficient for some purposes but if you want to reduce the 80% of your time that is spent munging data that has been munged before, it is better to have the mapping documented and to use disclosed subjects with identifying properties.

Having said all of that, these are tools you can use now on patents and/or extend them to other data sets. The disambiguation problems addressed for patents are the common ones you have encountered with other names for entities.

If a topic map underlies your analysis, the less time you will spend on the next analysis of the same information. Think of it as reducing your intellectual overhead in subsequent data sets.

Income – Less overhead = Greater revenue for you. 😉

PS: Don’t be confused, you are looking for EPO Worldwide Patent Statistical Database (PATSTAT). Naturally there is a US organization, that is just patent litigation statistics.

PPS: Sam Hunting, the source of so many interesting resources, pointed me to this post.

German Record Linkage Center

Sunday, July 20th, 2014

German Record Linkage Center

From the webpage:

The German Record Linkage Center (GermanRLC) was established in 2011 to promote research on record linkage and to facilitate practical applications in Germany. The Center will provide several services related to record linkage applications as well as conduct research on central topics of the field. The services of the GermanRLC are open to all academic disciplines.

Wikipedia describes record linkage as:

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.

While very similar to topic maps, record linkage relies upon the creation of a common record for further processing, as opposed to pointing into an infoverse to identify subjects in their natural surroundings.

Another difference in practice is that the subjects (headers, fields, etc.) that contain subjects are not themselves treated as subjects with identity. That is to say that how a mapping from an original form was made to the target form is opaque to a subsequent researcher.

I first saw this in a tweet by Lars Marius Garshol.

How Statistics lifts the fog of war in Syria

Monday, March 17th, 2014

How Statistics lifts the fog of war in Syria by David White.

From the post:

In a fascinating talk at Strata Santa Clara in February, HRDAG’s Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that some victims were reported by no agency at all. By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)

Caution is always advisable with government issued data but especially so when it arises from an armed conflict.

A forerunner to topic maps, record linkage (which is still widely used), plays a central role in collating data recorded in various ways. It isn’t possible to collate heterogeneous data without creating a uniform set of records (record linkage) or by mapping the subjects of the original records together (topic maps).

The usual moniker, “big data” should really be: “big, homogeneous data (BHD). Which if that is what you have, works great. If that isn’t what you have, works less great. If at all.

BTW, groups like the Human Rights Data Analysis Group (HRDAG) would have far more credibility with me if their projects list didn’t read:

  • Africa
  • Asia
  • Europe
  • Middle East
  • Central America
  • South America

Do you notice anyone missing from that list?

I have always thought that “human rights” included cases of:

  • sexual abuse
  • chlid abuse
  • violence
  • discrimination
  • and any number of similar issues

I can think of another place where those conditions exist in epidemic proportions.

Can’t you?

Duke 1.2 Released!

Sunday, February 16th, 2014

Lars Marius Garshol has released Duke 1.2!

From the homepage:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.2 (see ReleaseNotes).

Duke can find duplicate customer records, or other kinds of records in your database. Or you can use it to connect records in one data set with other records representing the same thing in another data set. Duke has sophisticated comparators that can handle spelling differences, numbers, geopositions, and more. Using a probabilistic model Duke can handle noisy data with good accuracy.


  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Genetic algorithm for automatically tuning configurations.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. The examples of use page lists real examples of using Duke, complete with data and configurations. This presentation has more the big picture and background.


Until you know which two or more records are talking about the same subject, it’s very difficult to know what to map together.

Master Indexing and the Unified View

Monday, March 25th, 2013

Master Indexing and the Unified View by David Loshin.

From the post:

1) Identity resolution – The master data environment catalogs the set of representations that each unique entity exhibits in the original source systems. Applying probabilistic aggregation and/or deterministic rules allows the system to determine that the data in two or more records refers to the same entity, even if the original contexts are different.

2) Data quality improvement – Linking records that share data about the same real-world entity enable the application of business rules to improve the quality characteristics of one or more of the linked records. This doesn’t specifically mean that a single “golden copy” record must be created to replace all instances of the entity’s data. Instead, depending on the scenario and quality requirements, the accessibility of the different sources and the ability to apply those business rules at the data user’s discretion will provide a consolidated view that best meets the data user’s requirements at the time the data is requested.

3) Inverted mapping – Because the scope of data linkage performed by the master index spans the breadth of both the original sources and the collection of data consumers, it holds a unique position to act as a map for a standardized canonical representation of a specific entity to the original source records that have been linked via the identity resolution processes.

In essence this allows you to use a master data index to support federated access to original source data while supporting the application of data quality rules upon delivery of the data.

It’s been a long day but does David’s output have all the attributes of a topic map?

  1. Identity resolution – Two or more representatives the same subject
  2. Data quality improvement – Consolidated view of the data based on a subject and presented to the user
  3. Inverted mapping – Navigation based on a specific entity into original source records


Duplicate Detection on GPUs

Saturday, March 23rd, 2013

Duplicate Detection on GPUs by Benedikt Forchhammer, Thorsten Papenbrock, Thomas Stening, Sven Viehmeier, Uwe Draisbach, Felix Naumann.


With the ever increasing volume of data and the ability to integrate different data sources, data quality problems abound. Duplicate detection, as an integral part of data cleansing, is essential in modern information systems. We present a complete duplicate detection workflow that utilizes the capabilities of modern graphics processing units (GPUs) to increase the efficiency of finding duplicates in very large datasets. Our solution covers several well-known algorithms for pair selection, attribute-wise similarity comparison, record-wise similarity aggregation, and clustering. We redesigned these algorithms to run memory-efficiently and in parallel on the GPU. Our experiments demonstrate that the GPU-based workflow is able to outperform a CPU-based implementation on large, real-world datasets. For instance, the GPU-based algorithm deduplicates a dataset with 1.8m entities 10 times faster than a common CPU-based algorithm using comparably priced hardware.

Synonyms: Duplicate detection = entity matching = record linkage (and all the other alternatives for those terms).

This looks wicked cool!

I first saw this in a tweet by Stefano Bertolo.

Duke 1.0 Release!

Monday, March 4th, 2013

Duke 1.0 Release!

From the project page:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. The latest version is 1.0 (see ReleaseNotes).


  • High performance.
  • Highly configurable.
  • Support for CSV, JDBC, SPARQL, and NTriples DataSources.
  • Many built-in comparators.
  • Plug in your own data sources, comparators, and cleaners.
  • Command-line client for getting started.
  • API for embedding into any kind of application.
  • Support for batch processing and continuous processing.
  • Can maintain database of links found via JNDI/JDBC.
  • Can run in multiple threads.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This early presentation describes the ideas behind the engine and the intended architecture; a later and more up to date presentation has more practical detail and examples. There's also the ExamplesOfUse page, which lists real examples of using Duke, complete with data and configurations.

Excellent news on the data depulication front!

And for topic map authors as well (see the examples).

Kudos to Lars Marius Garshol!

Merging Data Sets Based on Partially Matched Data Elements

Friday, September 28th, 2012

Merging Data Sets Based on Partially Matched Data Elements by Tony Hirst.

From the post:

A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes.

Reminds me of the techniques used in record linkage (epidemiology). There, unlike topic maps, the records from diverse sources were mapped to a target record layout and then analyzed.

Quite powerful but lossy with regard to the containers of the original data.

Tracking Down an Epidemic’s Source

Tuesday, August 14th, 2012

Tracking Down an Epidemic’s Source (Physics 5, 89 (2012) | DOI: 10.1103/Physics.5.89)

From the post:

Epidemiologists often have to uncover the source of a disease outbreak with only limited information about who is infected. Mathematical models usually assume a complete dataset, but a team reporting in Physical Review Letters demonstrates how to find the source with very little data. Their technique is based on the principles used by telecommunication towers to pinpoint cell phone users, and they demonstrate its effectiveness with real data from a South African cholera outbreak. The system could also work with other kinds of networks to help governments locate contamination sources in water systems or find the leaders in a network of terrorist contacts.

A rumor can spread across a user network on Twitter, just as a disease spreads throughout a network of personal contacts. But there’s a big difference when it comes to tracking down the source: online social networks have volumes of time-stamped data, whereas epidemiologists usually have information from only a fraction of the infected individuals.

To address this problem, Pedro Pinto and his colleagues at the Swiss Federal Institute of Technology in Lausanne (EPFL) developed a model based on the standard network picture for epidemics. Individuals are imagined as points, or “nodes,” in a plane, connected by a network of lines. Each node has several lines connecting it to other nodes, and each node can be either infected or uninfected. In the team’s scenario, all nodes begin the process uninfected, and a single source node spreads the infection from neighbor to neighbor, with a random time delay for each transmission. Eventually, every node becomes infected and records both its time of infection and the identity of the infecting neighbor.

To trace back to the source using data from a fraction of the nodes, Pinto and his colleagues adapted methods used in wireless communications networks. When three or more base stations receive a signal from one cell phone, the system can measure the difference in the signal’s arrival time at each base station to triangulate a user’s position. Similarly, Pinto’s team combined the arrival times of the infection at a subset of “observer” nodes to find the source. But in the infection network, a given arrival time could correspond to multiple transmission paths, and the time from one transmission to the next varies randomly. To improve their chances of success, the team used the fact that the source had to be one of a finite set of nodes, unlike a cell phone user, who could have any of an infinite set of coordinates within the coverage area.

Summarizes: Locating the Source of Diffusion in Large-Scale Networks Pedro C. Pinto, Patrick Thiran, and Martin Vetterli Phys. Rev. Lett. 109, 068702 (2012).

One wonders if participation in multiple networks, some social, some electronic, some organizational, would be amenable to record linkage type techniques?

Leaks from government could be tracked using only one type of network but that is likely to be incomplete and misleading.

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Sunday, July 8th, 2012

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen.

In the Foreword, William E. Winkler (U. S. Census Bureau and dean of record linkage), writes:

Within this framework of historical ideas and needed future work, Peter Christen’s monograph serves as an excellent compendium of the best existing work by computer scientists and others. Individuals can use the monograph as a basic reference to which they can gain insight into the most pertinent record linkage ideas. Interested researchers can use the methods and observations as building blocks in their own work. What I found very appealing was the high quality of the overall organization of the text, the clarity of the writing, and the extensive bibliography of pertinent papers. The numerous examples are quite helpful because they give real insight into a specific set of methods. The examples, in particular, prevent the researcher from going down some research directions that would often turn out to be dead ends.

I saw the alert for this volume today so haven’t had time to acquire and read it.

Given the high praise from Winkler, I expect it to be a pleasure to read and use.

Experiments in genetic programming

Monday, March 19th, 2012

Experiments in genetic programming

Lars Marius Garshol writes:

I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.

You need to read about the experiments in the post but I can almost hear Lars saying the conclusion:

The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That’s almost shocking, but there you are. I guess I need to turn the script into an official feature.


Excellent post and approach by the way!

Lars also posted a link to Reddit about his experiments. Several links appear in comments that I have turned into short posts to draw more attention to them.

Another tool for your topic mapping toolbox.

Question: I wonder what it would look like to have the intermediate results used for mapping, only to be replaced as “better” mappings become available? Has a terminating condition but new content can trigger additional cycles but only as relevant to its content.

Or would queries count as new content? If they expressed synonymy or other relations?

Duke 0.4

Friday, January 13th, 2012

Duke 0.4

New release of deduplication software written in Java on top of Lucene by Lars Marius Garshol.

From the release notes:

This version of Duke introduces:

  • Added JNDI data source for connecting to databases via JNDI (thanks to FMitzlaff).
  • In-memory data source added (thanks to FMitzlaff).
  • Record linkage mode now more flexible: can implement different strategies for choosing optimal links (with FMitzlaff).
  • Record linkage API refactored slightly to be more flexible (with FMitzlaff).
  • Added utilities for building equivalence classes from Duke output.
  • Made the XML config loader more robust.
  • Added a special cleaner for English person names.
  • Fixed bug in NumericComparator ( issue 66 )
  • Uses own Lucene query parser to avoid issues with search strings.
  • Upgraded to Lucene 3.5.0.
  • Added many more tests.
  • Many small bug fixes to core, NTriples reader, ec.

BTW, the documentation is online only:

Journal of Computing Science and Engineering

Monday, December 19th, 2011

Journal of Computing Science and Engineering

From the webpage:

Journal of Computing Science and Engineering (JCSE) is a peer-reviewed quarterly journal that publishes high-quality papers on all aspects of computing science and engineering. The primary objective of JCSE is to be an authoritative international forum for delivering both theoretical and innovative applied researches in the field. JCSE publishes original research contributions, surveys, and experimental studies with scientific advances.

The scope of JCSE covers all topics related to computing science and engineering, with a special emphasis on the following areas: embedded computing, ubiquitous computing, convergence computing, green computing, smart and intelligent computing, and human computing.

I got here from following a sponsor link at a bioinformatics conference.

Then just picking at random from the current issue I see:

A Fast Algorithm for Korean Text Extraction and Segmentation from Subway Signboard Images Utilizing Smartphone Sensors by Igor Milevskiy, Jin-Young Ha.


We present a fast algorithm for Korean text extraction and segmentation from subway signboards using smart phone sensors in order to minimize computational time and memory usage. The algorithm can be used as preprocessing steps for optical character recognition (OCR): binarization, text location, and segmentation. An image of a signboard captured by smart phone camera while holding smart phone by an arbitrary angle is rotated by the detected angle, as if the image was taken by holding a smart phone horizontally. Binarization is only performed once on the subset of connected components instead of the whole image area, resulting in a large reduction in computational time. Text location is guided by user’s marker-line placed over the region of interest in binarized image via smart phone touch screen. Then, text segmentation utilizes the data of connected components received in the binarization step, and cuts the string into individual images for designated characters. The resulting data could be used as OCR input, hence solving the most difficult part of OCR on text area included in natural scene images. The experimental results showed that the binarization algorithm of our method is 3.5 and 3.7 times faster than Niblack and Sauvola adaptive-thresholding algorithms, respectively. In addition, our method achieved better quality than other methods.

Secure Blocking + Secure Matching = Secure Record Linkage by Alexandros Karakasidis, Vassilios S. Verykios.


Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

A Survey of Transfer and Multitask Learning in Bioinformatics by Qian Xu, Qiang Yang.


Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

And the ones I didn’t list from the current issue are just as interesting and relevant to identity/mapping issues.

This journal is a good example of people who have deliberately reached further across disciplinary boundaries than most.

About the only excuse for not doing so left is the discomfort of being the newbie in a field not your own.

Is that a good enough reason to miss possible opportunities to make critical advances in your home field? (Only you can answer that for yourself. No one can answer it for you.)

Surrogate Learning

Monday, November 28th, 2011

Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.


We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

The two “real world” applications are ones you are likely to encounter:


Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.

Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)


Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)

This test was against 700000 financial news records.

Both tests were quite successful.

Surrogate learning looks interesting for a range of NLP applications.

Concord: A Tool That Automates the Construction of Record Linkage Systems

Sunday, November 27th, 2011

Concord: A Tool That Automates the Construction of Record Linkage Systems by Christopher Dozier, Hugo Molina Salgado, Merine Thomas, Sriharsha Veeramachaneni, 2010.

From the webpage:

Concord is a system provided by Thomson Reuters R&D to enable the rapid creation of record resolution systems (RRS). Concord allows software developers to interactively configure a RRS by specifying match feature functions, master record retrieval blocking functions, and unsupervised machine learning methods tuned to a specific resolution problem. Based on a developer’s configuration process, the Concord system creates a Java based RRS that generates training data, learns a matching model and resolves record information contained in files of the same types used for training and configuration.

A nice way to start off the week! Deeply interesting paper and a new name for record linkage.

Several features of Concord that merit your attention (among many):

A choice of basic comparison operations with the ability to extend seems like a good design to me. No sense overwhelming users with all the general comparison operators, to say nothing of the domain specific ones.

The blocking functions, which operate just as you suspect, narrows the potential set of records for matching down, is also appealing. Sometimes you may be better at saying what doesn’t match than what does. This gives you two bites at a successful match.

Surrogate learning, although I have located the paper cited on this subject and will be covering it in another post.

I have written to ThomsonReuters inquiring about availability of Concord, its ability to interchange mapping settings between instances of Concord or beyond. Will update when I hear back from them.

The Link King

Saturday, May 28th, 2011

The Link King: Record Linkage and Consolidation Software

From the website:

In the realm of public domain software for record linkage and unduplication (aka. dedupe software), The Link King reigns supreme. The Link King has fashioned a powerful alliance between sophisticated probabilistic record linkage and deterministic record linkage protocols incorporating features unavailable in many proprietary record linkage programs. (detailed overview (pdf))

The Link King’s probabilistic record linkage protocol was adapted from the algorithm developed by MEDSTAT for the Substance Abuse and Mental Health Services Administration’s (SAMHSA) Integrated Database Project. The deterministic record linkage protocols were developed at Washington State’s Division of Alcohol and Substance Abuse for use in a variety of evaluation and research projects.

The Link King’s graphical user interface (GUI) makes record linkage and unduplication easy for beginning and advanced users. The data linking neophyte will appreciate the easy-to-follow instructions. The Link King’s artificial intelligence will assist in the selection of the most appropriate linkage/unduplication protocol. The technical wizard will appreciate the discussion of data linkage/unduplication issues in The Link King’s user manual, the variety of user-specified options for blocking and linkage decisions, and the powerful interface for manual review of “uncertain” linkages.

Looks very interesting but requires an SAS “base license.”

I don’t have pricing information for an SAS “base license.”

Integrated Public Use Microdata Series

Friday, May 20th, 2011

Integrated Public Use Microdata Series (IPUMS-USA)

Lars Marius asked about some test data files for his Duke 0.1 release.

A lot of record linkage work is on medical records so there are disclosure agreements/privacy concerns, etc.

Just poking around for sample data sets and ran across this site.

From the website:

IPUMS-USA is a project dedicated to collecting and distributing United States census data. Its goals are to:

  • Collect and preserve data and documentation
  • Harmonize data
  • Disseminate the data absolutely free!

Goes back to the 1850 US Census and comes forward.

More data sets than I can easily describe and more are being produced.

Occurs to me that this could be good data for testing topic map techniques.


Duke 0.1 Release

Thursday, May 19th, 2011

Duke 0.1 Release

Lars Marius Garshol on Duke 0.1:

Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread.

Version 0.1 has been released, consisting of a command-line tool which can read CSV, JDBC, SPARQL, and NTriples data. There is also an API for programming incremental processing and storing the result of processing in a relational database.

The GettingStarted page explains how to get started and has links to further documentation. This blog post describes the basic approach taken to match records. It does not deal with the Lucene-based lookup, but describes an early, slow O(n^2) prototype. This presentation describes the ideas behind the engine and the intended architecture.

If you have questions, please contact the developer, Lars Marius Garshol, larsga at

I will look around for sample data files.


Thursday, April 14th, 2011


Lars Marius Garshol slides from an internal Bouvet conference on deduplication of data.

And, DUplicate KillEr, DUKE.

As Lars points out, people have been here before.

I am not sure I share Lars’ assessment of the current state of record linkage software.

Consider for example, FRIL – Fine-Grained Record Integration and Linkage Tool, which is described as:

FRIL is FREE open source tool that enables fast and easy record linkage. The tool extends traditional record linkage tools with a richer set of parameters. Users may systematically and iteratively explore the optimal combination of parameter values to enhance linking performance and accuracy.
Key features of FRIL include:

  • Rich set of user-tunable parameters
  • Advanced features of schema/data reconciliation
  • User-tunable search methods (e.g. sorted neighborhood method, blocking method, nested loop join)
  • Transparent support for multi-core systems
  • Support for parameters configuration
  • Dynamic analysis of parameters
  • And many, many more…

I haven’t used FRIL but do note that it has documentation, videos, etc. for user instruction.

I have reservations about record linkage in general, but those are concerns about re-use of semantic mappings and not record linkage per se.

Oyster: A Configurable ER Engine

Wednesday, February 9th, 2011

Oyster: A Configurable ER Engine

John Talburt writes a very enticing overview of an entity resolution engine he calls Oyster.

From the post:

OYSTER will be unique among freely available systems in that it supports identity management and identity capture. This allows the user to configure OYSTER to not only run as a typical merge-purge/record linking system, but also as an identity capture and identity resolution system. (Emphasis added)

Yes, record linking we have had since the late 1950’s in a variety of guises and over twenty (20) different names that I know of.

Adding identity management and identity capture (FYI, SW uses universal identifier assignment) will be something truly different.

As in topic map different.

Will be keeping a close watch on this project and suggest that you do the same.

Record Linkage: Similarity Measures and Algorithms

Thursday, January 20th, 2011

Record Linkage: Similarity Measures and Algorithms Authors Nick Koudas, Sunita Sarawagi, Divesh Srivastava

A little dated (2006) but still a very useful review of similarity measures under the rubric of record linkage.

The R Journal, Issue 2/2

Friday, December 31st, 2010

The R Journal, Issue 2/2 has arrived!

Download complete issue.

Or Individual articles.

A number of topic map relevant papers are in this issue, ranging from stringr: modern, consistent string processing, Hadley Wickham; to the edgy cudaBayesreg: Bayesian Computation in CUDA, Adelino Ferreira da Silva; to a technique that started in the late 1950’s, The RecordLinkage Package: Detecting Errors in Data, Murat Sariyar and Andreas Borg.