Archive for the ‘Identifiers’ Category
Monday, April 15th, 2013
Miriam Registry
From the homepage:
Persistent identification for life science data
The MIRIAM Registry provides a set of online services for the generation of unique and perennial identifiers, in the form of URIs. It provides the core data which is used by the Identifiers.org resolver.
The core of the Registry is a catalogue of data collections (corresponding to controlled vocabularies or databases), their URIs and the corresponding physical URLs or resources. Access to this data is made available via exports (XML) and Web Services (SOAP).
And from the FAQ:
What is MIRIAM, and what does it stand for?
MIRIAM is an acronym for the Minimal Information Required In the Annotation of Models. It is important to distinguish between the MIRIAM Guidelines, and the MIRIAM Registry. Both being part of the wider BioModels.net initiative.
What are the ‘MIRIAM Guidelines’?
The MIRIAM Guidelines are an effort to standardise upon the essential, minimal set of information that is sufficient to annotate a model in such a way as to enable its reuse. This includes a means to identify the model itself, the components of which it is composed, and formalises a means by which unambiguous annotation of components should be encoded. This is essential to allow collaborative working by different groups which may not be spatially co-located, and facilitates model sharing and reuse by the general modelling community. The goal of the project, initiated by the BioModels.net effort, was to produce a set of guidelines suitable for model annotation. These guidelines can be implemented in any structured format used to encode computational models, for example SBML, CellML, or NeuroML . MIRIAM is a member of the MIBBI family of community-developed ‘minimum information’ reporting guidelines for the biosciences.
More information on the requirements to achieve MIRIAM Guideline compliance is available on the MIRIAM Guidelines page.
What is the MIRIAM Registry?
The MIRIAM Registry provides the necessary information for the generation and resolving of unique and perennial identifiers for life science data. Those identifiers are of the URI form and make use of Identifiers.org for providing access to the identified data records on the Web. Examples of such identifiers: http://identifiers.org/pubmed/22140103, http://identifiers.org/uniprot/P01308, …
More identifiers for the life sciences, for those who choose to use them.
The curation may be helpful in terms of mappings to other identifiers.
Posted in Identifiers, Science | No Comments »
Monday, April 1st, 2013
A recent comment made it clear that tooling, or the lack thereof, is a real issue for topic maps.
Here is my first suggestion of a tool you can use while authoring a topic map:
Wikipedia.
Seriously, think about it. You want a URL that identifies subject X.
Granting that Wikipedia is a fairly limited set of subjects, it is at least a starting point.
Example: I want a subject identifier for “Donald Duck,” a cartoon character.
I can use the search box at Wikipedia or I can type in a browser:
http://en.wikipedia.org/wiki/Donald%20Duck
Go ahead, try it.
If I don’t know the full name:
http://en.wikipedia.org/wiki/Donald
What do you think?
Allows you to disambiguate Donalds, at least the ones that Wikipedia knows about.
Not to mention giving you access to other subjects and relationships that may be of interest for your topic map.
To include foreign language materials (outside of English only non-thinking zones in the U.S.), try a different language Wikipedia:
http://de.wikipedia.org/wiki/Donald%20Duck
Finding subject identifiers won’t write your topic map for you but can make the job easier.
There are other sources of subject identifiers so send in your suggestions and any syntax short-cuts for accessing them.
You have no doubt read that URIs used as identifiers are supposed to be semi-permanent, “cool,” etc.
But identifiers change over time. It’s one of the reasons for historical semantic diversity.
URIs as identifiers will change as well.
Good thing topic maps enable you to have multiple identifiers for any subject.
Means old references to old identifiers still work.
Glad we dodged having to redo and reproof all those old connections.
Aren’t you?
Posted in Identification, Identifiers, Subject Identifiers | No Comments »
Tuesday, January 22nd, 2013
User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)
Abstract:
This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.
An echo of Steve Newcomb’s semantic impedance appears at:
Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identified by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders defined the semantic gap as
the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)
In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)
That last sentence could have come out of a topic map book.
Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”
If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”
Posted in Geographic Data, Geographic Information Retrieval, Geography, Identifiers, Keywords, Mapping, Maps, Semantic Diversity, Semantic Inconsistency | 1 Comment »
Saturday, January 5th, 2013
The IUPAC International Chemical Identifier (InChI) and its influence on the domain of chemical information edited by Dr. Anthony Williams.
From the webpage:
The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to deduplicate, validate and link together chemical compounds and related information across databases. Its influence has been especially valuable as the internet has exploded in terms of the amount of chemistry related information available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.
If you are interested in chemistry/cheminformatics or in the development and use of identifers, this is an issue to not miss!
You will find:
InChIKey collision resistance: an experimental testing by Igor Pletnev, Andrey Erin, Alan McNaught, Kirill Blinov, Dmitrii Tchekhovskoi, Steve Heller.
Consistency of systematic chemical identifiers within and between small-molecule databases by Saber A Akhondi, Jan A Kors, Sorel Muresan.
InChI: a user’s perspective by Steven M Bachrach.
InChI: connecting and navigating chemistry by Antony J Williams.
I particularly enjoyed Steven Bachrach’s comment:
It is important to recognize that in no way does InChI replace or make outmoded any other chemical identifier. A company that has developed their own registry system or one that uses one of the many other identifiers, like a MOLfile [13], can continue to use their internal system. Adding the InChI to their system provides a means for connecting to external resources in a simple fashion, without exposing any of their own internal technologies.
Or to put it differently, InChl increased the value of existing chemical identifiers.
How’s that for a recipe for adoption?
Posted in Cheminformatics, Identifiers | No Comments »
Wednesday, December 12th, 2012
I am working on a draft about identifiers (using the standard <a> element) when it occurred to me that URLs could play an unexpected role in document security. (At least unexpected by me, your mileage may vary.)
What if I create a document that has URLs like:
<a href="http://server-exists.x/page-does-not.html>text content</a>
So that a user who attempts to follow the link, gets a “404″ message back.
Why is that important?
What if I am writing HTML pages at a nuclear weapon factory? I would be very interested in knowing if one of my pages had gotten off the reservation so to speak.
The server being accessed for a page that deliberately does not exist could route the contact information for an appropriate response.
Of course, I would use better names or have pages that load, while transmitting the same contact information.
Or have a very large uuencoded “password” file that burps, bumps and slowly downloads. (Always knew there was a reason to keep a 2400 baud modem around.)
Have suggestions on how to make a non-existent URL work but will save that for another day.
Posted in HTML, Identifiers, Security | No Comments »
Friday, October 26th, 2012
Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.
I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?
My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.
It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”
That confusion resolved, I read:
Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.
Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”
…
Interested parties can check out this demo or read the tech overview and grab the source code here.
I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.
Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?
Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.
Disambiguation at the point of origin.
The title of the original article could become:
“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”
Seems less ambiguous to me.
Certainly less ambiguous to a search engine.
You?
Posted in Ambiguity, Disambiguation, Identification, Identifiers, Natural Language Processing, Semantics | No Comments »
Saturday, September 29th, 2012
You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:
You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.
A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.
“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.
“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)
I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed!
But that’s the trick of subject identification/identity isn’t it?
That our brains “recognize” all manner of subjects without any effort on our part.
Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.
Or rather, making it more work than we are usually willing to devote to digging it out.
When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.
Two quick points:
First, need to think about how to incorporate this “feature” into delivery interfaces for users.
Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)
Posted in Identification, Identifiers, Image Processing, Image Recognition, Marketing, Subject Identifiers, Subject Identity | No Comments »
Saturday, September 22nd, 2012
Dancing With Dirty Data Thanks to SAP Visual Intelligence by Timo Elliott.
From the post:
(graphic omitted)
Here’s my entry for the SAP Ultimate Data Geek Challenge, a contest designed to “show off your inner geek and let the rest of world know your data skills are second to none.” There have already been lots of great submissions with people using the new SAP Visual Intelligence data discovery product.
I thought I’d focus on one of the things I find most powerful: the ability to create visualizations quickly and easily even from real-life, messy data sources. Since it’s election season in the US, I thought I’d use some polling data on whether voters believe the country is “headed in the right direction.” There is lots of different polling data on this (and other topics) available at pollingreport.com.
Below you can see the data set I grabbed: as you can see, the polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent (the month is not always included, sometimes spaces around the middle dash, sometimes not…).
Take a closer look at Timo’s definition of “dirty” data: “…polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent….”
Sure, that’s “dirty” data all right, but only one form of dirty data. It is dirty data that arises from typographical inconsistency. Inconsistency that prevents reliable automated processing.
Another form of dirty data arises from identifier inconsistency. That is one or more identifiers are used for the same subject, and/or the same identifier is used for different subjects.
I take the second form, identifier inconsistency to be distinct from typographical inconsistency. Can turn out to overlap but conceptually I find it helpful to distinguish the two.
Resolution of either form of inconsistency requires judgement about the reference being made by the identifiers.
Question: If you are resolving typographical inconsistency, do you keep a map of the resolution? If not, why not?
Question: Same questions for identifier inconsistency.
Posted in Identifiers, SAP, SAP Visual Intelligence | No Comments »
Monday, September 17th, 2012
Identities and Identifications: Politicized Uses of Collective Identities
Deadline for Panels 15 January 2013
Deadline for Papers 1 March 2013
Conference 18-20 April 2013, Zagreb, Croatia
From the call for panels and papers:
Identity is one of the crown jewelleries in the kingdom of ‘contested concepts’. The idea of identity is conceived to provide some unity and recognition while it also exists by separation and differentiation. Few concepts were used as much as identity for contradictory purposes. From the fragile individual identities as self-solidifying frameworks to layered in-group identifications in families, orders, organizations, religions, ethnic groups, regions, nation-states, supra-national entities or any other social entities, the idea of identity always shows up in the core of debates and makes everything either too dangerously simple or too complicated. Constructivist and de-constructivist strategies have led to the same result: the eternal return of the topic. Some say we should drop the concept, some say we should keep it and refine it, some say we should look at it in a dynamic fashion while some say it’s the reason for resistance to change.
If identities are socially constructed and not genuine formations, they still hold some responsibility for inclusion/exclusion – self/other nexuses. Looking at identities in a research oriented manner provides explanatory tolls for a wide variety of events and social dynamics. Identities reflect the complex nature of human societies and generate reasonable comprehension for processes that cannot be explained by tracing pure rational driven pursuit of interests. The feelings of attachment, belonging, recognition, the processes of values’ formation and norms integration, the logics of appropriateness generated in social organizations are all factors relying on a certain type of identity or identification. Multiple identifications overlap, interact, include or exclude, conflict or enhance cooperation. Identities create boundaries and borders; define the in-group and the out-group, the similar and the excluded, the friend and the threatening, the insider and the ‘other’.
Beyond their dynamic fuzzy nature that escapes exhaustive explanations, identities are effective instruments of politicization of social life. The construction of social forms of organization and of specific social practices together with their imaginary significations requires all the time an essentialist or non-essentialist legitimating act of belonging; a social glue that extracts its cohesive function from the identification of the in-group and the power of naming the other. Identities are political. Multicultural slogans populate extensively the twenty-first century yet the distance between the ideal and the real multiculturalism persists while the virtues of inclusion coexist with the adversity of exclusion. Dealing with the identities means to integrate contestation into contestation until potentially a n degree of contestation. Due to the confusion between identities and identifications some scholars demanded that the concept of identity shall be abandoned. Identitarian issues turned out to be efficient tools for politicization of a ‘constraining dissensus’ while universalizing terms included in the making of the identities usually tend or intend to obscure the localized origins of any identitarian project. Identities are often conceptually used as rather intentional concepts: they don’t say anything about their sphere but rather defining the sphere makes explicit the aim of their usage. It is not ‘identity of’ but ‘identity to’.
Quick! Someone get them a URL!
Just teasing.
Enjoy the conference!
Posted in Identification, Identifiers, Identity | No Comments »
Monday, September 10th, 2012
I happened upon an author who used an arXiv.org author identifier at their webpage.
From the arXiv.org page:
It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv. Such identification would provide accurate results for queries such as "show me all the other papers by the particular John Smith that wrote this paper", something that can be done only approximately with text-based searches. It would also permit construction of an author-article graph which is useful for relevance
assessment and bibliometric analysis.
Since 2005 arXiv has used authority records that associate user accounts with articles authored by that user. These records support the endorsement system. The use of public author identifiers as a way to build services upon this data is new in 2009. Initially, users must opt-in to have a public author identifier and to expose the record of their articles on arXiv for use in other services. At some later date we hope to be able to improve our authority records to the point where we can create public author identifiers for all authors of arXiv articles without needing to enlist the help of each author to check their record before opting in.
The services we offer based on author identifiers are:
Significant enough in its own right but note the plans for the future:
The following enhancements and interoperability features are planned:
- arXiv will permit authors to record other identifiers they have in other schemes and include these in the data feeds. This will allow agents and systems to link together the same author in different databases.
- arXiv will support mechanisms for handling name changes, combination of accidentally created duplicates and separation of accidentally combined identifiers.
Recoding other identifiers? What? Acknowledge that there can be more than one identifier (yours) per subject? Blasphemy!
Go arXiv.org!
I am sure suggestions, support, contributions, etc., would be most welcome.
Posted in Identification, Identifiers, Subject Identifiers | No Comments »
Tuesday, September 4th, 2012
I enhanced the VLDB 2012 program with author queries to the DBLP Computer Science Bibliography for my own purposes.
After using that listing myself for a few days, it occurred to me that I should be using DBLP entries as author identifiers throughout my posts, at least when such entries exist.
For several reasons, but mostly:
- DBLP maintains the publication listings (not by me!)
- DBLP maintains pointers to other databases and resources (also not by me!)
- DBLP maintains advanced search capabilities beyond authors (again, not by me!)
If you noticed not by me forming a pattern, you would be correct. There is a pattern.
The pattern?
Using DBLP author pages as identifiers, I leverage on (not duplicate) the work of the DBLP project.
To the benefit of my readers. (Not to mention myself.)
The DBLP link brings an author’s publication history, their co-authors, and additional bibliographic resources. (That’s a triple I like.)
It takes a moment to insert the link but the payoff is substantial.
When you cite a CS author in your blog, include their DBLP link. We will all thank you for it.
(I did that once upon a time but lapsed. Will be cleaning up older entries and trying to do better in the future.)
PS: Similar sources of identifiers for other disciplines?
Posted in Bibliography, Identifiers | No Comments »
Friday, June 29th, 2012
Bruce: How Well Does Current Legislative Identifier Practice Measure Up?
From Legal Informatics:
Tom Bruce of the Legal Information Institute at Cornell University Law School (LII) has posted Identifiers, Part 3: How Well Does Current Practice Measure Up?, on LII’s new legislative metadata blog, Making Metasausage.
In this post, Tom surveys legislative identifier systems currently in use. He recommends the use of URIs for legislative identifiers, rather than URLs or URNs.
He cites favorably the URI-based identifier system that John Sheridan and Dr. Jeni Tennison developed for the Legislation.gov.uk system. Tom praises Sheridan’s (here) and Tennison’s (here and here) writings on legislative URIs and Linked Data.
Tom also praises the URI system implemented by Dr. Rinke Hoekstra in the Leibniz Center for Law‘s Metalex Document Server for facilitating point-in-time as well as point-in-process identification of legislation.
Tom concludes by making a series of recommendations for a legislative identifier system:
See the post for his recommendations (in case you are working on such a system) and for other links.
I would point out that existing legislation has identifiers from before it receives the “better” identifiers specified here.
And those “old” identifiers will have been incorporated into other texts, legal decisions and the like.
Oh.
We can’t re-write existing identifiers so it’s a good thing topic maps accept subjects having identifiers, plural.
Posted in Identifiers, Law, Law - Sources, Legal Informatics | No Comments »
Sunday, June 10th, 2012
Deconstructing the Google Knowledge Graph
Mike Bergman has some interesting observations on the Google Knowledge Graph, first on its coverage and then on how it is constructing URLs for nodes in its graph.
I have to second his call for Google to release its identifiers via an API. That would be a real boon for common entities.
I say common entities because having “millions” of identifiers is fairly trivial when you consider the number of objects captured every night by optical astronomers alone. Or sequencing genomes.
Not to discount the value of a common identifier for Lady Gaga but uncommon entities need identifiers too.
Gabriel Hopmans pointed me to this post. (Morpheus)
Posted in Google Knowledge Graph, Identifiers | No Comments »
Saturday, May 26th, 2012
Outlier detection in two review articles (Part 2) by Sandro Saitta.
From the post:
Here we go with the second review article about outlier detection (this post is the continuation of Part I).
A Survey of Outlier Detection Methodologies
This paper, from Hodge and Austin, is also an excellent review of the field. Authors give a list of keywords in the field: outlier detection, novelty detection, anomaly detection, noise detection, deviation detection and exception mining. For the authors, “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Grubbs, 1969)”. Before listing several application in the field, authors mention that an outlier can be “surprising veridical data“. It may only be situated in the wrong class.
An interesting list of possible reasons for outliers is given: human error, instrument error, natural deviations in population, fraudulent behavior, changes in behavior of system and faults in system. Like in the first article, Hodge and Austin define three types of approaches to outlier detection (unsupervised, supervised and semi-supervised). In the last one, they mention that some algorithms can allow a confidence in the fact that the observation is an outlier. Main drawback of the supervised approach is its inability to discover new types of outliers.
While you are examining the techniques, do note the alternative ways to identify the problem.
Can you say topic map?
Simple query expansion, assuming that any single term return hundreds of papers, isn’t all that helpful. Instead of several hundred papers you get several thousand. Gee, thanks.
But that isn’t an indictment of alternative identifications of subjects, that is a problem of granularity.
Returning documents forces users to wade through large amounts of potentially irrelevant content.
The question is how to retain alternative identifications of subjects while returning a manageable (or configurable) amount of content?
Suggestions?
Posted in Identifiers, Outlier Detection, Topic Maps | No Comments »
Friday, May 25th, 2012
Bruce on Legislative Identifier Granularity
From the post:
In this post, Tom [Bruce] explores legislative identifier granularity, or the level of specificity at which such an identifier functions. The post discusses related issues such as the incorporation of semantics in identifiers; the use of “pure” (semantics-free) legislative identifiers; and how government agency authority and procedural rules influence the use, “persistence, and uniqueness” of identifiers. The latter discussion leads Tom to conclude that
a “gold standard” system of identifiers, specified and assigned by a relatively independent body, is needed at the core. That gold standard can then be extended via known, stable relationships with existing identifier systems, and designed for extensible use by others outside the immediate legislative community.
Interesting and useful reading.
Even though a “gold standard” of identifiers for something as dynamic as legislation, isn’t likely.
Or rather, isn’t going to happen.
There are too many stakeholders in present systems for any proposal to carry the day.
Not to mention decades, if not centuries, of references in other systems.
Posted in Identifiers, Law, Law - Sources, Legal Informatics | No Comments »
Wednesday, May 9th, 2012
Bruce on the Functions of Legislative Identifiers
From Legal Informatics:
In this post, Tom [Bruce] discusses the multiple functions that legislative document identifiers serve. These include “unique naming,” “navigational reference,” “retrieval hook / container label,” “thread tag / associative marker,” “process milestone,” and several more.
A promised second post will examine issues of identifier design.
Enjoy and pass along!
Posted in Identifiers, Law, Law - Sources, Legal Informatics | No Comments »
Thursday, May 3rd, 2012
Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).
Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):
A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.
So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.
As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.
So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.
So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.
Posted in Identification, Identifiers, Identity, Marketing, Topic Maps | No Comments »
Sunday, April 29th, 2012
Legal Entity Identifier – Preparing for the Inevitable by Peter Ku.
From the post:
Most of the buzz around the water cooler for those responsible for enterprise reference data in financial services has been around the recent G20 meeting in Switzerland on the details of the proposed Legal Entity Identifier (LEI). The LEI is designed to help regulators manage and monitor systemic risk in the financial markets by creating a unique ID to recognize legal entities/counterparties shared by the global financial companies and government regulators. Agreement to adoption is expected to be decided at the G20 leaders’ summit coming up in June in Mexico as regulators decide the details as to the administration, implementation and enforcement of the standard. Will the new LEI solve the issues that led to the recent financial crisis?
Looking back at history, this is not the first time the financial industry has attempted to create a unique ID system for legal entities, remember the Data Universal Numbering System (DUNS) identifier as an example? What is different from the past is that the new LEI standard is set at a global vs. regional level which had caused past attempts to fail. Unfortunately, the LEI standard will not replace existing IDs that firms deal with every day. Instead, it creates further challenges requiring companies to map existing IDs to the new LEI, reconciling naming differences, maintain legal hierarchy relationships between parent and subsidiary entities from ongoing corporate actions, and also link it to the securities and loans to the legal entities.
….
While many within the industry are waiting to see what the regulators decide in June, existing issues related to the quality, consistency, and delivery of counterparty reference data and the downstream impact on managing risk needs to be dealt with regardless if LEI is passed. In the same report, I shared the challenges firms will face incorporating the LEI including:
- Accessing, reconciling, and relating existing counterparty information and IDs to the new LEI
- Effectively identifying and resolving data quality issues from external and internal systems
- Accurately identifying legal hierarchy relationships which LEI will not maintain in its first instantiation.
- Cross referencing legal entities with financial and securities instruments
- Extending both counterparty and securities instruments to downstream front, mid, and back office systems.
As a topic map person, do any of these issues sound familiar to you?
In particular creating a new identifier to solve problems with resolving multiple “old” ones?
Being mindful that all data systems are capable of and/or contain errors, intentional (dishonest) and otherwise.
Presuming perfect records, and perfect data in those records, not only guarantees failure, but avenues for abuse.
Peter cites resources you will need to read.
Posted in Identifiers, Law, Legal Entity Identifier (LEI), Legal Informatics | No Comments »
Friday, April 20th, 2012
Without restricting it to being machine readable, I think we would all agree there are three ages of data:
- Past data
- Present data
- Future data
And we have common goals for data (or parts of it):
- Past data – To understand past data.
- Present data – To be understood by others.
- Future data – For our present data to persist and be understood by then users.
Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)
I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.
You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.
Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:
“I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.”
Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.
Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.
What do you think?
Posted in Identification, Identifiers, Identity | No Comments »
Wednesday, April 18th, 2012
David Loshin as a series of posts going at the Data Roundtable:
The Perils of Bad Names
and
The Impact of Data Element Renaming…
In “Bad Names,” David cites this example:
An example of this might be a column named “STREET_ADDRESS,” but that instead of that field holding a street number and name, it contains a set of flags indicating the types of customer correspondences that are to be sent to a home address instead of an email address. From one perspective, our assumption about what was stored in that field were mistaken, but on the other hand, conventional wisdom might have suggested otherwise.
I would agree, that at least looks like a bad name. Moreover, its one that is likely to trip up successors who have to deal with the data set.
David goes on to argue in “Renaming,” that finding and replacing all the uses of this name may lead to worse problems.
Ah, after thinking about it for a bit, I can see he has a point.
How about you?
Posted in Identifiers, Names | No Comments »
Friday, April 6th, 2012
I was reminded of the title quote when I read Richard Wallis’s: A Fundamental Linked Data Debate.
Contrary to Richard’s imaginings, the vast majority of people on and off the Web are not waiting for the debates on the W3C’s Technical Architecture (TAG) or Linked Open Data (public-lod) mailing lists to be resolved.
Why?
They had identifiers for subjects long before the WWW, Semantic Web, Linked Data or whatever and will have identifiers for subjects long after those efforts and their successors are long forgotten.
Some of those identifiers are still in use today and will survive well into the future. Others are historical curiosities.
Moreover, when it was necessary to distinguish between identifiers and the things identified, that need was met.
Entire the WWW and its poster child, Tim Berners-Lee.
It was Tim Berners-Lee who created the problem Richard frames as: “the difference between a thing and a description of that thing.”
Amazing how much fog of discussion there has been to cover up that amateurish mistake.
The problem isn’t one of conflicting world views (a la Jeni Tennison) but rather how given a bare URI, how to interpret it? Given the bad choices made in the Garden of the Web as it were.
That we simply abandon bare URIs as a solution has never darkened their counsel. They would rather impose the 303/TBL burden on everyone rather than admit to fundamental error.
I have a better solution.
The rest of us should carry on with the identifiers that we want to use, whether they be URIs or not. Whether they are prior identifiers or new ones. And we should put forth statements/standards/documents to establish how in our contexts, those identifiers should be used.
If IBM, Oracle, Microsoft and a few other adventurers decide that IT can benefit from some standard terminology, I am sure they can influence others to use it. Whether composed of URIs or not. And the same can be said for many other domains, most of who will do far better than the W3C at fashioning identifiers for themselves.
Take heart TAG and LOD advocates.
As the poem says: “Give me your tired, your poor, your huddled identifiers yearning to be used.”
Someday your identifiers will be preserved as well.
Posted in Identifiers, RDF, Semantic Web | No Comments »
Friday, April 6th, 2012
URN:LEX: New Version 06 Available
From the purpose of the namespace “lex:”
The purpose of the “lex” namespace is to assign an unequivocal identifier, in standard format, to documents that are sources of law. To the extent of this namespace, “sources of law” include any legal document within the domain of legislation, case law and administrative acts or regulations; moreover potential “sources of law” (acts under the process of law formation, as bills) are included as well. Therefore “legal doctrine” is explicitly not covered.
The identifier is conceived so that its construction depends only on the characteristics of the document itself and is, therefore, independent from the document’s on-line availability, its physical location, and access mode.
This identifier will be used as a way to represent the references (and more generally, any type of relation) among the various sources of law. In an on-line environment with resources distributed among different Web publishers, uniform resource names allow simplified global interconnection of legal documents by means of automated hypertext linking.
If creating names just for law “sources” sounds like low-lying fruit to you, take some time to become familiar with the latest draft.
Posted in Identifiers, Law, Law - Sources, Legal Informatics | No Comments »
Thursday, March 15th, 2012
Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.
I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”
His abstract there read:
The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.
I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.
The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.
I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.
More to follow.
Posted in Books, Data Models, Data Science, Identifiers, Identity, Subject Identifiers, Subject Identity | No Comments »
Tuesday, March 13th, 2012
Then BI and Data Science Thinking Are Flawed, Too
Steve Miller writes:
I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.
Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.
So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.
According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”
The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”
Reminds me of Thinking, Fast and Slow by Daniel Kahneman.
Not that two books with a similar “take” proves anything but you should put them on your reading list.
I wonder when/where our perceptions of CS practices have been skewed?
Or where that has played a role in our decision making about information systems?
Posted in Identification, Identifiers, Marketing, Subject Identifiers, Subject Identity | No Comments »
Wednesday, February 1st, 2012
Yesterday I closed with these lines:
Requirement: A system of identification must support the same identifiers resolving to different identifications.
The consequences of deciding otherwise on such a requirement, I will try to take up tomorrow. (Multiple Recognitions)
Rereading that for today’s post, I don’t agree with myself.
The requirement isn’t a requirement at all but an observation that the same identifier may have multiple resolutions.
Better to say that the designer of systems of identification should be aware of that observation. To avoid situations like I posed yesterday with “I will call you a cab” example.
A fortuitous mistake because it leads to the next issue that I wanted to address: Do identifiers have contexts in which they have only a single resolution?
Yesterday’s mistake has made me more wary of sweeping pronouncements so I am posing the context issue as a question.
Can you think of any counter-examples?
The easiest place to look would be in comedy, where mistaken identity (such as in Shakespeare), double meanings, etc., are bread and butter of the art. Two or more people hear or see the same identifier and reach different resolutions.
In those cases, if we had a rule that identifiers could only have a single resolution, we would have to simply skip over those cases. That seems like an inelegant solution.
Or would you shrink the context down to the individuals who had the different resolutions of an identifier?
Perhaps, perhaps but then what is your solution when later in the play one or more individuals discover their mistake and now hold a common resolution but still remember the one that was in error? Or perhaps more than one that was in error? How do we describe the context(s) there?
There is a long history of such situations in comedy. You may be tempted to say that recreational literature can be excluded. That “fictional” work isn’t the first place we want semantic technologies to work.
Perhaps but remember that comedy and “fiction” have their origin in our day to day affairs. The misunderstandings they parody are our misunderstandings.
The saying: “what did X know and when did they know it?” takes on new meaning when we take about the interpretation of identifiers. Perhaps “freedom fighter” is a more sympathetic term until you “know” those forces are operating death squads. And may have different legal consequences.
How do you think boundaries for contexts should be set/designated? Seems like that would be an important issue to take up.
Posted in Context, Identification, Identifiers, Semantics | No Comments »
Wednesday, January 18th, 2012
Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang
From the post:
At Boundary we have developed a system for unique id generation. This started with two basic goals:
- Id generation at a node should not require coordination with other nodes.
- Ids should be roughly time-ordered when sorted lexicographically. In other words they should be k-ordered 1, 2.
All that is required to construct such an id is a monotonically increasing clock and a location 3. K-ordering dictates that the most-significant bits of the id be the timestamp. UUID-1 contains this information, but arranges the pieces in such a way that k-ordering is lost. Still other schemes offer k-ordering with either a questionable representation of ‘location’ or one that requires coordination among nodes.
Just in case you are looking for a decentralized source of K-ordered unique IDs.
First seen at: myNoSQL as: Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang.
Posted in Erlang, Identifiers | No Comments »
Tuesday, December 27th, 2011
Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.
I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.
Kahneman says early on (page 28):
The premise of this book is that it is easier to recognize other people’s mistakes than our own.
I thought about that line when I read a note from a friend that topic maps needed more than my:
tagging everything with “Topic Maps….”
Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.
One premise of this blog is that the use and recognition of identifiers is essential for communication.
Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.
The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.
Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?
For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)
Topic maps aren’t everywhere but identifiers and recognition of identifiers are.
Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem
Posted in Identification, Identifiers, Marketing, Subject Identifiers, Subject Identity, Topic Maps | 3 Comments »
Monday, October 24th, 2011
OCLC Developer Network
From the webpage:
The OCLC Developer Network is a community of developers collaborating to propose, discuss and test OCLC Web Services. This open source, code-sharing infrastructure improves the value of OCLC data for all users by encouraging new OCLC Web Service uses.
Thought while I was looking at OCLC resources I might as well give a shout out to the OCLC Developer Network. A community that has an interest in identifiers and identification for the purpose of furthering access to information. Who could be more sympathetic to topic maps?
Posted in Identification, Identifiers, Library Associations, OCLC Number | No Comments »
Monday, October 24th, 2011
WorldCat Identities Network
A project of OCLC Research, the WorldCat Identities Network is described as:
The WorldCat Identity Network uses the WorldCat Identities Web Service and the WorldCat Search API to create an interactive Related Identity Network Map for each Identity in the WorldCat Identities database. The Identity Maps can be used to explore the interconnectivity between WorldCat Identities.
A WorldCat Identity can be a person, a thing (e.g., the Titanic), a fictitious character (e.g., Harry Potter), or a corporation (e.g., IBM).
I can’t claim to be a fan of jumpy network node displays but that isn’t a criticism, more a matter of personal taste. Some people find that sort of display quite useful.
The information conveyed, leaving display to one side, is quite interesting. It has just enough fuzziness (to me at any rate) to approach the experience of serendipitous discovery using more traditional library tools. I suspect that will vary from topic to topic but that was my experience with briefly using the interface.
Despite my misgivings about the interface, I will be returning to explore this service fairly often.
BTW, the service is obviously mis-named. What is being delivered is what we used to call “see also” or related references, thus: WorldCat “See Also” Network would be a more accurate title.
For class:
- Spend at least an hour or more with the service and write a 2 page summary of what you liked/disliked about it. (no citations)
- What subject/relationship did you choose to follow? Discover anything you did not expect? 1 page (no citations)
Posted in Associations, Identification, Identifiers | No Comments »
Thursday, September 8th, 2011
I was picking tomatoes in the garden when I thought about telling Carol (my wife) the plants are about to stop producing.
Those plants are at a particular address, in the backyard, middle garden bed of three, are of three different varieties, but I am going to sum up those properties by saying: “The tomatoes are about to stop producing.”
It occurred to me that a subjectIdentifier could be assigned to a topic element on the basis of summing up properties of the topic.* That would have the advantage of enabling merging on the basis of subjectIdentifiers as opposed to more complex tests upon properties of a topic.
Disclosure of the basis for assignment of a subjectIdentifier is an interesting question.
It could be that a service wishes to produce subjectIdentifiers and index information based upon complex property measures, producing for consumption, the subjectIdentifiers and merge-capable indexes on one or more information sets. The basis for merging being the competitive edge offered by the service.
If promoting merging with a vendor’s process or format, which is seeking to become the TCP/IP of some area, the basis for merging and tools to assist with it will be supplied.
Or if you are an intelligence agency and you want an inward and outward facing interface that promotes merging of information but does not disclose your internal basis for identification, variants of this technique may be of interest.
*The notion of summing up imposes no prior constraints on the tests used or the location of the information subjected to those tests.
Posted in Identification, Identifiers, Intelligence, Subject Identifiers, Subject Identity | No Comments »