Archive for the ‘Associations’ Category

Who Do You Love? (Visualizing Relationships/Associations)

Thursday, February 11th, 2016

This Chart Shows Who Marries CEOs, Doctors, Chefs and Janitors by Adam Pearce and Dorothy Gambrell.

From the post:

When it comes to falling in love, it’s not just fate that brings people together—sometimes it’s their jobs. We scanned data from the U.S. Census Bureau’s 2014 American Community Survey—which covers 3.5 million households—to find out how people are pairing up. Some of the matches seemed practical (the most common marriage is between grade-school teachers), and others had us questioning Cupid’s aim (why do female dancers have a thing for male welders?). High-earning women (doctors, lawyers) tend to pair up with their economic equals, while middle- and lower-tier women often marry up. In other words, female CEOs tend to marry other CEOs; male CEOs are OK marrying their secretaries.

The listing of occupations and spousal relationship is interactive on mouse-over and you can type in the name of a profession. (Warning: On typing in the profession name, it must be a case match for the term in this listing.

Here’s a sample for Librarians:


The relationships are gender-coded:


Try to guess which occupations have “marries within occupation” and those which do not.

For each of the following, what is your guess about marrying within the occupation or not?

  • Ambulance Drivers
  • Atmospheric and Space Scientists
  • Economists
  • Postal Service

This looks like a great browsing technique for exploring relationships (associations).

Information Extraction framework in Python

Friday, November 7th, 2014

Information Extraction framework in Python

From the post:

IEPY is an open source tool for Information Extractionfocused on Relation Extraction.

To give an example of Relation Extraction, if we are trying to find a birth date in:

“John von Neumann (December 28, 1903 – February 8, 1957) was a Hungarian and American pure and applied mathematician, physicist, inventor and polymath.”

then IEPY’s task is to identify “John von Neumann” and “December 28, 1903” as the subject and object entities of the “was born in” relation.

It’s aimed at:
  • users needing to perform Information Extraction on a large dataset.
  • scientists wanting to experiment with new IE algorithms.

Your success with recognizing relationships will vary but every one successfully recognized is one less that must be coded by hand.

Speaking of relationships, I would prefer to also have the relationships between John von Neumann and “Hungarian and American pure and applied mathematician, physicist, inventor and polymath” recognized as well.

I first saw this in a tweet by Scientific Python.

A Guide To Who Hates Whom In The Middle East

Monday, September 15th, 2014

A Guide To Who Hates Whom In The Middle East by John Brown Lee.

John reviews an interactive visualization of players with an interest in the Middle East by David McCandless of Information is Beautiful.

The full interactive version of The Middle East Key players & notable relationships.

I would use this graphic with caution, mostly because if you select Jordan, it show no relationship to Israel. As you know, Jordan signed a peace agreement with Israel twenty years ago and Israel recently agreed to sell gas to Jordan’s state-owned National Electric Power Co.

Nor does it show any relationship between Turkey and the United States. At the very least, the United States and Turkey have a complicated relationship. Would you include the reported pettiness of Senator John McCain towards Turkey in an enhanced map?

Not to take anything away from a useful way to explore the web of relationships in the Middle East but more in the nature of a request for a fuller story.

Chordalysis: a new method to discover the structure of data

Thursday, November 28th, 2013

Chordalysis: a new method to discover the structure of data by Francois Petitjean.

From the post:

…you can’t use log-linear analysis if your dataset has more than, say, 10 variables! This is because the process is exponential in the number of variables. That is where our new work makes a difference. The question was: how can we keep the rigorous statistical foundations of classical log-linear analysis but make it work for datasets with hundreds of variables?

The main part of the answer is “chordal graphs”, which are the graphs made of triangular structures. We showed that for this class of models, the theory is scalable for high-dimensional datasets. The rest of the solution involved melding the classical statistical machinery with advanced data mining techniques from association discovery and graphical modelling.

The result is Chordalysis: a log-linear analysis method for high-dimensional data. Chordalysis makes it possible to discover the structure of datasets with hundreds of variables on a standard computer. So far we’ve applied it successfully to datasets with up to 750 variables. (emphasis added)


Scaling log-linear analysis to high-dimensional data (PDF), by Francois Petitjean, Geoffrey I. Webb and Ann E. Nicholson.


Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We develop an efficient approach to log-linear analysis that scales to hundreds of variables by melding the classical statistical machinery of log-linear analysis with advanced data mining techniques from association discovery and graphical modeling.

Being curious about what was meant by “…a standard computer…” I searched the paper to find:

The conjunction of these features makes it possible to scale log-linear analysis to hundreds of variables on a standard desktop computer. (page 3 of the PDF, the pages are unnumbered)

Not a lot clearer but certainly encouraging!

The data used in the paper can be found at:

The Chordalysis wiki looks helpful.

So, are your clients going to be limited to 10 variables or a somewhat higher number?

Relationship Timelines

Sunday, September 22nd, 2013

Relationship Timelines by Skye Bender-deMoll.

From the post:

I finally had a chance to pull together a bunch of interesting timeline examples–mostly about the U.S. Congress. Although several of these are about networks, the primary features being visualized are changes in group structure and membership over time. Should these be called “alluvial diagrams”, “stream graphs” “Sankey charts”, “phase diagrams”, “cluster timelines”?

From the U.S. Congress to characters in the Lord of the Rings (movie version) and beyond, Skye explores visualization of dynamic relationships over time.

Raises the interesting issue of how do you represent a dynamic relationship in a topic map?

For example, at some point in a topic map of a family, the mother and father did not know each other. At some later point they met, but were not yet married. Still later they were married and later still, had children. Other events in their lives happened before or after those major events.

Scope could segment off a segment of events, but you would have to create a date/time datatype or use one from the W3C, XML Schema Part 2: Datatypes Second Edition, for calculation of which scope precedes or follows another scope.

A closely related problem is to show what facts were known to a person at some point in time. Or as put by Howard Baker:

“What did the President know and when did he know it?” [During the Watergate Hearings

That may again be a relevant question in the not too distant future.

Suggestions for a robust topic map modeling solution would be most welcome!

Unknown Association Roles (TMDM specific)

Saturday, April 13th, 2013

As I was pondering the treatment of nulls in Neo4j (Null Values in Neo4j), it occurred to me that we have something quite similar in the TMDM.

The definition of association items includes this language:

[roles]: A non-empty set of association role items. The association roles for all the topics that participate in this relationship.

I read this as saying that if I don’t know their role, I can’t include a known player in an association.

For example, I am modeling an association between two players to a phone conversation, who are discussing a drone strike or terrorist attack by other means.

I know their identities but I don’t know their respective roles in relationship to each other or in the planned attack.

I want to capture this association because I may have other associations where they are players where roles are known. Perhaps enabling me to infer possible roles in this association.

Newcomb has argued roles in associations are unique and in sum, constitute the type of the association. I appreciate the subtlety and usefulness of that position but it isn’t a universal model for associations.

By the same token, the TMDM restricts associations to use where all roles are known. Given that roles are often unknown, that also isn’t a universal model for associations.

I don’t think the problem can be solved by an “unknown role” topic because that would merge unknown roles across associations.

My preference would be to allow players to appear in associations without roles.

Where the lack of a role prevents the default merging of associations. That is, all unknown roles are presumed to be unique.


50,000 Lessons on How to Read:…

Friday, April 12th, 2013

50,000 Lessons on How to Read: a Relation Extraction Corpus by Dave Orr, Product Manager, Google Research.

From the post:

One of the most difficult tasks in NLP is called relation extraction. It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that Jim Henson was in a spouse relation with Jane Henson (and in a creator relation with many beloved characters and shows).

The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“Who created Kermit?”), learn which proteins interact in the biomedical literature, or to build a database of hundreds of millions of entities and billions of relations to try and help people explore the world’s information.

To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.

Another step in the “right” direction.

This is a human-curated set of relation semantics.

Rather than trying to apply this as a universal “standard,” what if you were to create a similar data set for your domain/enterprise?

Using human curators to create and maintain a set of relation semantics?

Being a topic mappish sort of person, I suggest the basis for their identification of the relationship be explicit, for robust re-use.

But you can repeat the same analysis over and over again if you prefer.

Biological Database of Images and Genomes

Wednesday, April 3rd, 2013

Biological Database of Images and Genomes: tools for community annotations linking image and genomic information by Andrew T Oberlin, Dominika A Jurkovic, Mitchell F Balish and Iddo Friedberg. (Database (2013) 2013 : bat016 doi: 10.1093/database/bat016)


Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype–genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas.

Database URL: BioDIG website:

BioDIG source code repository:

The MyDIG database:

Linking image data to genomic data. Sounds like associations to me.


Not to mention the heterogeneity of genomic data.

Imagine extending an image/genomic data association by additional genomic data under a different identification.

Bacon, Pie and Pregnancy

Sunday, February 10th, 2013

Searching for “Biscuit Bliss,” a book of biscuit recipes, also had the result:

People also search for

The Glory of Southern Cooking James Villas

The Bacon Cookbook James Villas

Texas home cooking Cheryl Jamison

The Joy of Pregnancy

Pie Ken Haedrich

If I were writing associations for “Biscuit Bliss,” pie would not make the list.

Bacon I can see because it is a major food group along side biscuits.

I suppose the general cooking books are super-classes of biscuit making.

Some female friends have suggested eating is associated with pregnancy.

True, but when I search for “joy of pregnancy,” it doesn’t suggest cookbooks in general or biscuits in particular.

If there is an association, is it non-commutative?*

Suggested associations of biscuits with pregnancy? (mindful of the commutative/non-commutative question)

* I am not altogether certain what a non-commutative association would look like. Partial ignorance from a point of view?

One player in the association having knowledge of the relationship and the other player does not?

Some search engines already produce that result, whether by design or not I don’t know.


Friday, January 18th, 2013

PPInterFinder—a mining tool for extracting causal relations on human proteins from literature by Kalpana Raja, Suresh Subramani and Jeyakumar Natarajan. (Database (2013) 2013 : bas052 doi: 10.1093/database/bas052)


One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information on PPIs from MEDLINE abstracts and consists of three phases. First, it identifies the relation keyword using a parser with Tregex and a relation keyword dictionary. Next, it automatically identifies the candidate PPI pairs with a set of rules related to PPI recognition. Finally, it extracts the relations by matching the sentence with a set of 11 specific patterns based on the syntactic nature of PPI pair. We find that PPInterFinder is capable of predicting PPIs with the accuracy of 66.05% on AIMED corpus and outperforms most of the existing systems.

Database URL:

I thought the shortened form of the title would catch your eye. 😉

Important work for bioinformatics but it is also an example of domain specific association mining.

By focusing on a specific domain and forswearing designs on being a universal association solution, PPInterFinder produces useful results today.

A lesson that should be taken and applied to semantic mappings more generally.

Machine Learning and Data Mining – Association Analysis with Python

Thursday, January 17th, 2013

Machine Learning and Data Mining – Association Analysis with Python by Marcel Caraciolo.

From the post:

Recently I’ve been working with recommender systems and association analysis. This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships.

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets. Frequent item sets are simply a collection of items that frequently occur together. And association rules suggest a strong relationship that exists between two items.

When I think of associations in a topic map, I assume I am at least starting with the roles and the players of those roles.

As this post demonstrates, that may be overly optimistic on my part.

What if I discover an association but not its type or the roles in it? And yet I still want to preserve the discovery for later use?

An incomplete association as it were.


Fast rule-based bioactivity prediction using associative classification mining

Sunday, November 25th, 2012

Fast rule-based bioactivity prediction using associative classification mining by Pulan Yu and David J Wild. (Journal of Cheminformatics 2012, 4:29 )

Who moved my acronym? continues: ACM = Association for Computing Machinery or associative classification mining.


Relating chemical features to bioactivities is critical in molecular design and is used extensively in lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, the classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) method, and produce highly interpretable models.

An interesting lead on investigation of associations in large data sets. Pass on those meeting a threshold on for further evaluation?

Visualising associations between paired `omics’ data sets

Saturday, November 17th, 2012

Visualising associations between paired `omics’ data sets by Ignacio González, Kim-Anh Lê Cao, Melissa J Davis and Sébastien Déjean.



Each omics platform is now able to generate a large amount of data. Genomics, proteomics, metabolomics, interactomics are compiled at an ever increasing pace and now form a core part of the fundamental systems biology framework. Recently, several integrative approaches have been proposed to extract meaningful information. However, these approaches lack of visualisation outputs to fully unravel the complex associations between different biological entities.


The multivariate statistical approaches ‘regularized Canonical Correlation Analysis’ and ‘sparse Partial Least Squares regression’ were recently developed to integrate two types of highly dimensional ‘omics’ data and to select relevant information. Using the results of these methods, we propose to revisit few graphical outputs to better understand the relationships between two ‘omics’ data and to better visualise the correlation structure between the different biological entities. These graphical outputs include Correlation Circle plots, Relevance Networks and Clustered Image Maps. We demonstrate the usefulness of such graphical outputs on several biological data sets and further assess their biological relevance using gene ontology analysis.


Such graphical outputs are undoubtedly useful to aid the interpretation of these promising integrative analysis tools and will certainly help in addressing fundamental biological questions and understanding systems as a whole.


The graphical tools described in this paper are implemented in the freely available R package mixOmics and in its associated web application.

Just in case you are looking for something a little more challenging this weekend than political feeds on Twitter. 😉

Is “higher dimensional” data everywhere? Just more obvious in the biological sciences?

If so, there are lessons here for manipulation/visualization of higher dimensional data in other areas as well.

Modeling Question: What Happens When Dots Don’t Connect?

Saturday, October 13th, 2012

Working with a data set and have run across a different question than vagueness/possibility of relationships. (see Topic Map Modeling of Sequestration Data (Help Pls!) if you want to help with that one.)

What if when analyzing the data I determine there is no association between two subjects?

I am assuming that if there is no association, there are no roles at play.

How do I record the absence of the association?

I don’t want to trust the next user will “notice” the absence of the association.

A couple of use cases come to mind:

I suspect there is an association but have no proof. The cheating husband/wife scenario. (I suppose there I would know the “roles.”)

What about corporations or large organizations? Allegations are made but no connection to identifiable actors.

Corporations act only through agents. A charge that names the responsible agents is different from a general allegation.

How do I distinguish those? Or make it clear no agent has been named?

Wouldn’t that be interesting?

We read now: XYZ corporation plead guilty to government contract fraud.

We could read: A, B, and C, XYZ corporation and L, M, N, government contract officers managed the XYZ government contract. XYZ plead guilty to contract fraud and was fined $.

Could keep better score on private and public employees that keep turning up in contract fraud cases.

One test for transparency is accountability.

No accountability, no transparency.

Dilbert schematics

Monday, July 23rd, 2012

Dilbert schematics

In November of 2011, Dan Brickley writes:

How can we package, manage, mix and merge graph datasets that come from different contexts, without getting our data into a terrible mess?

During the last W3C RDF Working Group meeting, we were discussing approaches to packaging up ‘graphs’ of data into useful chunks that can be organized and combined. A related question, one always lurking in the background, was also discussed: how do we deal with data that goes out of date? Sometimes it is better to talk about events rather than changeable characteristics of something. So you might know my date of birth, and that is useful forever; with a bit of math and knowledge of today’s date, you can figure out my current age, whenever needed. So ‘date of birth’ on this measure has an attractive characteristic that isn’t shared by ‘age in years’.

At any point in time, I have at most one ‘age in years’ property; however, you can take two descriptions of me that were at some time true, and merge them to form a messy, self-contradictory description. With this in mind, how far should we be advocating that people model using time-invariant idioms, versus working on better packaging for our data so it is clearer when it was supposed to be true, or which parts might be more volatile?

Interesting to read as an issue for RDF modeling.

Not difficult to solve using scopes on associations in a topic map.

Question: What difficulties do time-invariant idioms introduce for modeling? What difficulties do non-time-invariant idioms introduce for processing?*

Different concerns and it isn’t enough to have an answer to a modeling issue without understanding the implications of the answer.

*Hint: As I read the post, it assumes a shared, “objective” notion of time. Perhaps works for the cartoon world, but what about elsewhere?

Towards Bisociative Knowledge Discovery

Monday, July 2nd, 2012

Towards Bisociative Knowledge Discovery by Michael R. Berthold.


Knowledge discovery generally focuses on finding patterns within a reasonably well connected domain of interest. In this article we outline a framework for the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a more powerful way. We motivate this approach, show the difference to classical data analysis and conclude by describing a number of different types of domain-crossing connections.

What is a bisociation you ask?

Informally, bisociation can be defined as (sets of) concepts that bridge two otherwise not –or only very sparsely– connected domains whereas an association bridges concepts within a given domain.Of course, not all bisociation candidates are equally interesting and in analogy to how Boden assesses the interestingness of a creative idea as being new, surprising, and valuable [4], a similar measure for interestingness can be specified when the underlying set of domains and their concepts are known.

Berthold describes two forms of bisociation as bridging concepts and graphs, although saying subject identity and associations would be more familiar to topic map users.

This essay introduces more than four hundred pages of papers so there is much more to explore.

These materials are “open access” so take the opportunity to learn more about this developing field.

As always, terminology/identification is going to vary so there will be a role for topic maps.

Why were they laughing?

Monday, January 30th, 2012

Why were they laughing?

An amusing posting from Junk Charts with charts of laughter in Federal Reserve’s FOMC meetings up to the recent crash.

Readers are cautioned about making comparisons based on time-series data.

The same caution applies to creating associations based on time-series data.

Still, an amusing post to start the week.

Mr. Pearson, meet Mr. Mandelbrot:…

Saturday, December 17th, 2011

Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

Something you may enjoy along with: Detecting Novel Associations in Large Data Sets.

Jeremy Fox asks what I think about this paper by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti which proposes a new nonlinear R-squared-like measure.

My quick answer is that it looks really cool!

From my quick reading of the paper, it appears that the method reduces on average to the usual R-squared when fit to data of the form y = a + bx + error, and that it also has a similar interpretation when “a + bx” is replaced by other continuous functions.

The Coron System

Sunday, December 11th, 2011

The Coron System

From the overview:

Coron is a domain and platform independent, multi-purposed data mining toolkit, which incorporates not only a rich collection of data mining algorithms, but also allows a number of auxiliary operations. To the best of our knowledge, a data mining toolkit designed specifically for itemset extraction and association rule generation like Coron does not exist elsewhere. Coron also provides support for preparing and filtering data, and for interpreting the extracted units of knowledge.

In our case, the extracted knowledge units are mainly association rules. At the present time, finding association rules is one of the most important tasks in data mining. Association rules allow one to reveal “hidden” relationships in a dataset. Finding association rules requires first the extraction of frequent itemsets.

Currently, there exist several freely available data mining algorithms and tools. For instance, the goal of the FIMI workshops is to develop more and more efficient algorithms in three categories: (1) frequent itemsets (FI) extraction, (2) frequent closed itemsets (FCI) extraction, and (3) maximal frequent itemsets (MFI) extraction. However, they tend to overlook one thing: the motivation to look for these itemsets. After having found them, what can be done with them? Extracting FIs, FCIs, or MFIs only is not enough to generate really useful association rules. The FIMI algorithms may be very efficient, but they are not always suitable for our needs. Furthermore, these algorithms are independent, i.e. they are not grouped together in a unified software platform. We also did experiments with other toolkits, like Weka. Weka covers a wide range of machine learning tasks, but it is not really suitable for finding association rules. The reason is that it provides only one algorithm for this task, the Apriori algorithm. Apriori finds FIs only, and is not efficient for large, dense datasets.

Because of all these reasons, we decided to group the most important algorithms into a software toolkit that is aimed at data mining. We also decided to build a methodology and a platform that implements this methodology in its entirety. Another advantage of the platform is that it includes the auxiliary operations that are often missing in the implementations of single algorithms, like filtering and pre-processing the dataset, or post-processing the found association rules. Of course, the usage of the methodology and the platform is not narrowed to one kind of dataset only, i.e. they can be generalized to arbitrary datasets.

I found this too late in the weekend to do more than report it.

I have spent most of the weekend trying to avoid expanding a file to approximately 2 TB before parsing it. More on that saga later this week.

Anyway, Coron looks/sounds quite interesting.

Anyone using it that cares to comment on it?


Sunday, December 4th, 2011

FACTA – Finding Associated Concepts with Text Analysis

From the Quick Start Guide:

FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.

A very good example of an exploration tool that isn’t overly complex to use.

Similarity as Association?

Tuesday, November 29th, 2011

I was listening to Ian Robinson’s recent presentation on Dr. Who and Neo4j when Ian remarked that similarity could be modeled as a relationship.

It seemed like an off-hand remark at the time but it struck me as having immediate relevance to using Neo4j with topic maps.

One of my concerns for using Neo4j with topic maps has been the TMDM specification of merging topic items as:

1. Create a new topic item C.
2. Replace A by C wherever it appears in one of the following properties of an information item: [topics], [scope], [type], [player], and [reifier].
3. Repeat for B.
4. Set C’s [topic names] property to the union of the values of A and B’s [topic names] properties.
5. Set C’s [occurrences] property to the union of the values of A and B’s [occurrences] properties.
6. Set C’s [subject identifiers] property to the union of the values of A and B’s [subject identifiers] properties.
7. Set C’s [subject locators] property to the union of the values of A and B’s [subject locators] properties.
8. Set C’s [item identifiers] property to the union of the values of A and B’s [item identifiers] properties.
(TMDM, 6.2 Merging Topic Items)

Obviously the TMDM is specifying an end result and not how you get there but still, there has to be a mechanism by which a query that finds A or B, also results in the “union of the values of A and B’s [topic names] properties.” (And the other operations specified by the TMDM here and elsewhere.)

Ian’s reference to similarity being modeled as a relationship made me realize that similarity relationships could be created between nodes that share the same [subject identifiers} property value (and other conditions for merging). Thus, when querying a topic map, there should be the main query, followed by a query for “sameness” relationships for any returned objects.

This avoids the performance “hit” of having to update pointers to information items that are literally re-written with new identifiers. Not to mention that processing the information items that will be presented to the user as one object could even be off-loaded onto the client, with a further savings in server side processing.

There is an added bonus to this approach, particularly for complex merging conditions beyond the TMDM. Since the directed edges have properties, it would be possible to dynamically specify merging conditions beyond those of the TMDM based on those properties. Which means that “merging” operations could be “unrolled” as it were.

Or would that be “rolled down?” Thinking that a user could step through each addition of a “merging” condition and observe the values as they were added, along with their source. Perhaps even set “break points” as in debugging software.

Will have to think about this some more and work up some examples in Neo4j. Comments/suggestions most welcome!

PS: You know, if this works, Neo4j already has a query language, Cypher. I don’t know if Cypher supports declaration of routines that can be invoked as parts of queries but investigate that possibility. Just to keep users from having to write really complex queries to gather up all the information items on a subject. That won’t help people using other key/value stores but there are some interesting possibilities there as well. Will depend on the use cases and nature of “merging” requirements.


Wednesday, November 9th, 2011


From the readme:


Multispective is an open source intelligence management system based on the neo4j graph database. By using a graph database to capture information, we can use its immensely flexible structure to store a rich relationship model and easily visualize the contents of the system as nodes with relationships to one another.


The main purpose for creating this system is to provide socially motivated groups with an open source software product for managing their own intelligence relating to target networks, such as corporations, governments and other organizations. Multispective will provide these groups with a collective/social mechanism for obtaining and sharing insights into their target networks. My intention is that Multispective’s use of social media paradigms combined with visualisations will provide a well-articulated user interface into working with complex network data.

Inspired by the types of intelligence management systems used by law enforcement and national security agencies, Multispective will be great for showing things like corporate ownership and interest, events like purchases, payments (bribes), property transfers and criminal acts. The system will make it easier to look at how seemingly unrelated information is actually connected.

Multispective will also allow groups to overlap in areas of interest, discovering commonalities between discrete datasets, and being able to make use of data which has already been collected. (emphasis added)

The last two lines would not be out of place in any topic map presentation.

A project that is going to run into subject identity issues sooner rather than later. Experience and suggestions from the topic map camp would be welcome I suspect.

I don’t have a lot of extra time but I am going to toss my hat into the ring as at least interested in helping. How about you?

WorldCat Identities Network

Monday, October 24th, 2011

WorldCat Identities Network

A project of OCLC Research, the WorldCat Identities Network is described as:

The WorldCat Identity Network uses the WorldCat Identities Web Service and the WorldCat Search API to create an interactive Related Identity Network Map for each Identity in the WorldCat Identities database. The Identity Maps can be used to explore the interconnectivity between WorldCat Identities.

A WorldCat Identity can be a person, a thing (e.g., the Titanic), a fictitious character (e.g., Harry Potter), or a corporation (e.g., IBM).

I can’t claim to be a fan of jumpy network node displays but that isn’t a criticism, more a matter of personal taste. Some people find that sort of display quite useful.

The information conveyed, leaving display to one side, is quite interesting. It has just enough fuzziness (to me at any rate) to approach the experience of serendipitous discovery using more traditional library tools. I suspect that will vary from topic to topic but that was my experience with briefly using the interface.

Despite my misgivings about the interface, I will be returning to explore this service fairly often.

BTW, the service is obviously mis-named. What is being delivered is what we used to call “see also” or related references, thus: WorldCat “See Also” Network would be a more accurate title.

For class:

  1. Spend at least an hour or more with the service and write a 2 page summary of what you liked/disliked about it. (no citations)
  2. What subject/relationship did you choose to follow? Discover anything you did not expect? 1 page (no citations)

Reflective Random Indexing and indirect inference…

Tuesday, August 16th, 2011

Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections by Trevor Cohen, Roger Schvaneveldt, Dominic Widdows.


The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

The term “direct inference” is used for establishing a relationship between terms with a shared “bridging” term. That is the terms don’t co-occur in a text but share a third term that occurs in both texts. “Indirect inference,” that is finding terms with no shared “bridging” term is the focus of this paper.

BTW, if you don’t have access to the Journal of Biomedical Informatics version, try the draft: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections

MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge

Tuesday, August 16th, 2011

MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge



Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge.


We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships.


We applied our system on 5000 abstracts downloaded from PubMed database. We performed the performance evaluation as a gold standard is not yet available. Our system performed with a good precision and recall and we generated 24 hypotheses.


Our experiments show that MKEM is a powerful tool to discover hidden relationships residing in extracted entities that were represented by our Substance-Effect-Process-Disease-Body Part (SEPDB) model.

From the article:

Swanson defined UPK is public and yet undiscovered in two complementary and non-interactive literature sets of articles (independently created fragments of knowledge), when they are considered together, can reveal useful information of scientific interest not apparent in either of the two sets alone [cites omitted].

Basis of UPK:

The underlying discovery method is based on the following principle: some links between two complementary passages of natural language texts can be largely a matter of form “A cause B” (association AB) and “B causes C” (association BC) (See Figure 1). From this, it can be seen that they are linked by B irrespective of the meaning of A, B, or C. However, perhaps nothing at all has been published concerning a possible connection between A and C, even though such link if validated would be of scientific interest. This allowed for the generation of several hypotheses such as “Fish’s oil can be used for treatment of Raynaud’s Disease” [cite omitted].

Fairly easy reading and interesting as well.

If you recognize TF*IDF, the primary basis for Lucene, you will be interested to learn it has some weaknesses for UPK. If I understand the authors correctly, ranking terms statistically is insufficient to mine implied relationships. Related terms aren’t ranked high enough. I don’t think “boosting” would help because the terms are not known ahead of time. I say that, although I suppose you could “boost” on the basis of implied relationships. Will have to think about that some more.

You will find “non-interactive literature sets of articles” in computer science, library science, mathematics, law, just about any field you can name. Although you can mine across those “literature sets,” it would be interesting to identify those sets, perhaps with a view towards refining UPK mining. Can you suggest ways to distinguish such “literature sets?”

Oh, link to the software: MKEM (Note to authors: Always include a link to your software, assuming it is available. Make it easy on readers to find and use your hard work!)

Objectivity Infinite Graph (timed associations?)

Monday, May 9th, 2011

Objectivity Infinite Graph

Curt Monash that reports his conversation with Darren Wood, the lead developer for the Infinite Graph database product.

From last June (2010) but I think after reading it, you will agree it was worth bringing up.

A couple of goodies from his thoughts on edges:

  • Edges are first-class citizens in Infinite Graph, just as nodes are.
  • In Infinite Graph, edges can also have effectiveness date intervals. E.g., if you live at an address for a certain period, that’s when the edge connecting you to it is valid.

The second point, edges with date intervals, may have a bearing on a recent series of posts by Robert Cerny to the Topicmapmail list. (See: “Temporal validitity of subject indicators?” in the second quarter archives, early May 2011)

Is that timing for an association?

Tracking the relationships in Sex in the City would require such an ability.

Revealing the true challenges in fighting bank fraud

Friday, May 6th, 2011

Revealing the true challenges in fighting bank fraud

From the Infoglde blog:

The results of the survey are currently being compiled for general release, but it was extremely interesting to learn that the key challenges of fraud investigations include:

1. the inability to access data due to privacy concerns

2. a lack of real-time high performance data searching engine

3. and an inability to cross-reference and discover relationships between suspicious entities in different databases.

For regular readers of this blog, it comes as no surprise that identity resolution and entity analytics technology provides a solution to those challenges. An identity resolution engine glides across the different data within (or perhaps even external to) a bank’s infrastructure, delivering a view of possible identity matches and non-obvious relationships or hidden links between those identities… despite variations in attributes and/or deliberate attempts to deceive. (emphasis added)

It being an Infoglide blog, guess who they think has an identity resolution engine?

I looked at the data sheet on their Identity Resolution Engine.

I have a question:

If two separate banks are using “Identity Resolution Engine” have built up data mappings, on what basis do I merge those mappings, assuming there are name conflicts in the data mappings as well as in the data proper?

In an acquisition, for example, I should be able to leverage existing data mappings.

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness – Post

Tuesday, April 26th, 2011

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

This is a must read post by Jeff Jonas.

I won’t spoil your fun but Jeff defines terms such as:

  • Context-less Card Catalogs
  • Semantically Reconciled Directories
  • Semantically Reconciled and Relationship Aware Directories

and a number of others.

Looks very much like he is interested in the same issues as topic maps.

Take the time to read it and see what you think.

The Silent “a” In Mashup

Wednesday, February 9th, 2011

The “a” in mashup is silent because mashups are missing information that is represented in a topic map by associations.

That isn’t necessarily a criticism of mashups. How much or how little information you represent in any data set or application is up to you.

It is helpful to have a framework for understanding what information you have included or excluded by explicit choice. Why you made those choices or on what basis is entirely up to you.

As of 08-02-2010, there are fifteen definitions of mashup in English reported by define:Mashup in Google.

Most of the definitions of mashup do not exclude (necessarily) what is defined as an association in a topic map, but the general theme is one of juxtaposition of data from different resources.

That juxtaposition leaves the following subjects undefined (at least explicitly):

  1. role players in an association (play #2)
  2. roles in an association
  3. type of an association

Not to mention any TMCL (Topic Maps Constraint Language) constraints on those associations. (Something we will cover on another day.)

You can choose to leave subjects undefined, which is easier than defining them (witness the popularity of mashups), but there is a cost to leaving them undefined.

Defining or leaving subjects undefined is a decision that need to take into account factors such as ease of authoring versus the very real cost of leaving subjects undefined, as well as other factors. Such as your particular project’s requirements for maintenance, semantic integration and interchange.

For example, if the role players (#1 above) are left undefined in a mashup, what are the consequences?

From a topic map perspective, that means the role player subjects are not represented by topics, which means you cannot:

  1. attach other information about those subjects, such as variant names
  2. judge whether those are the same subjects as found in other associations
  3. find all the associations where those subjects are role players (since they are not explicitly identified)
  4. …among other things.

As I said, you can make that choice but while easier, that is less work, you also get less return from your mashup.

Another choice in a mashup, assuming that you identified the role players as topics, would be to simply not identify the roles they play in the mashup (read association).

If you don’t identify the roles as subjects (represented by topics), you can’t:

  1. compare those roles to roles in other mashups
  2. compare the roles being played by role players to roles they play in other associations
  3. discover associations with the same role players playing the same roles, but identified differently
  4. …among other things.

Assuming you have defined role players, the roles they play, there remains the type of the association (read mashup), which could help you locate other associations (mashups) that would be of interest to you.

Even if you defined a type for a mashup, I am not real sure where you would put it. That’s not an issue with a topic map association. It has an explicit type.

Mashups are easier to author than associations because they carry less information.

Which is a legitimate choice on your part.

What if after creating mashups we decide that it would be nice to add some more information?

Can topic maps help with that task?

We will take up the answer to that question tomorrow.

STRING – Known and Predicted Protein-Protein Interactions

Tuesday, February 1st, 2011

STRING – Known and Predicted Protein-Protein Interactions

From the website:

STRING is a database of known and predicted protein interactions.

The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources:

  • Genomic Context
  • High-throughput Experiments
  • (Conserved) Coexpression
  • Previous Knowledge

STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 2,590,259 proteins from 630 organisms. (Note: I had to alter the presentation from the website, which was a table to a list for the sources for the interactions.)

Looks like fertile ground for research on associations.