Archive for the ‘Identifiers’ Category

The Secret Life of Bar Codes

Thursday, June 8th, 2017

The Secret Life of Bar Codes by Richard Baguley.

From the post:

Some technologies you use every day, but without thinking about them. The bar code is one of these: everything you buy has one of these black and white striped codes on it. We’ve all seen how they are used: the cashier scans the code and the details and price pop up on the screen. The bar code identifies one product of the millions that are on sale at any one time. How does it do that? Let’s find out.

The bar code

The bar code itself is very simple: a series of black and white stripes of varying width. These are scanned by a bar code reader. Here, a rapidly moving laser passes over the code, and a sensor detects the reflection, picking up the alternating pattern of light and dark. A computer translates the differences between the widths of the patterns into numbers. One pattern is translated into a 0, another into a 1, another into a 2 and so on. You don’t have to have a laser to read a bar code, though. There are plenty of apps that can find a bar code in a picture taken with the onboard camera.

(image omitted)

The type of bar code used on products is known as a linear code, because you read it from left to right in a straight line. There are many other types that work differently and that can encode more data, from the QR codes that often contain website addresses to more arcane types such as the Maxicode that UPS uses to store the delivery address on package labels. These aren’t used on products you buy in a store, though, because all the bar code needs to contain for a product sold in the US is a single 12-digit number, called the Universal Product Code (UPC).

The structure of the bar code on things that you buy at the store is based on the UPC-A standard UPC code, This is 12 digits long, but there is a version that shortens this down to seven digits (called the UPC-E standard) by removing some of the data. This makes the bar code smaller, which is useful for smaller products like candy bars or chewing gum.

Be careful with Baguley’s post. The links I saw while copying are full of tracking BS, which I removed from the quote you see.

Still, a nice treatment of bar codes, one form of identifiers and a common one.

The Barcode Island is a treasure trove of information on bar codes. See also Bar Code 1, which has a link to Magazines on Barcodes.

I can understand and appreciate people learning assembly but magazines on bar codes? Yikes! Different strokes I guess.

Enjoy!

URLs Are Porn Vulnerable

Monday, June 22nd, 2015

Graham Cluley reports in Heinz takes the heat over saucy porn QR code that some bottles of Heinz Hot Ketchup provide more than “hot” ketchup. The QR code on the bottle leads to a porn site. (It is hard to put a “prize” in a ketchup bottle.)

Graham observes a domain registration lapsed for Heinz and the new owner wasn’t in the same line of work.

Are you presently maintaining every domain you have ever registered?

The lesson here is that URLs (as identifiers) are porn vulnerable.

Identifiers as Shorthand for Identifications

Tuesday, June 2nd, 2015

I closed Identifiers vs. Identifications? saying:

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

You can also think of identifiers as a form of shorthand for an identification. If we were working together in a fairly small office, you would probably ask, “Is Patrick in?” rather than listing all the properties that would serve as an identification for me. So all the properties that make up an identification are unspoken but invoked by the use of the identifier.

Works quite well in a small office because to some varying degree, we would all share the identifications that are represented by the identifiers we use in everyday conversation.

That sharing of identifications behind identifiers doesn’t happen in information systems, unless we have explicitly added identifications behind those identifiers.

One problem we need to solve is how to associate an identification with an identifier or identifiers. Looking only slightly ahead, we could use an explicit mechanism like a TMDM association, if we wanted to be able to talk about the subject of the relationship between an identifier and the identification that lies behind it.

But we are not compelled to talk about such a subject and could declare by rule that within a container, an identifier is a shorthand for properties of an identification in the same container. That assumes the identifier is distinguished from the properties that make up the identification. I don’t think we need to reinvent the notions of essential vs. accidental properties but merging rules should call out what properties are required for merging.

The wary reader will have suspected before now that many (if not all) of the terms in such a container could be considered as identifiers in and of themselves. Suddenly they are trying to struggle uphill from a swamp of subject recursion. It is “elephants all the way down.”

Have no fear! Just as we can avoid using TMDM associations to mark the relationship between an identifier and the properties making up an identification, we need use containers for identifiers and identifications only when and where we choose.

In some circumstances we may use bare identifiers, sans any identifications and yet add identifications when circumstances warrant it.

No level, identifiers, an identification, an identification that explodes other identifiers, etc., is right for every purpose. Each may be appropriate for some particular purpose.

We need to allow for downward expansion in the form of additional containers along side the containers we author, as well as extension of containers to add sub-containers for identifiers and identifications we did not or chose not to author.

I do have an underlying assumption that may reassure you about the notion of downward expansion of identifier/identification containers:

Processing of one or more containers of identifiers and identifications can choose the level of identifiers + identifications to be processed.

For some purposes I may only want to choose “top level” identifiers and identifications or even just parts of identifications. For example, think of the simple mapping of identifiers that happens in some search systems. You may examine the identifications for identifiers and then produce a bare mapping of identifiers for processing purposes. Or you may have rules for identifications that produce a mapping of identifiers.

Let’s assume that I want to create a set of the identifiers for Pentane and so I query for the identifiers that have the molecular property C5H12. Some of the identifiers (with their scopes) returned will be: Beilstein Reference 969132, CAS Registry Number 109-66-0, ChEBI CHEBI:37830, ChEMBL ChEMBL16102, ChemSpider 7712, DrugBank DB03119.

Each one of those identifiers may have other properties in their associated identifications, but there is no requirement that I produce them.

I mentioned that identifiers have scope. If you perform a search on “109-66-0” (CAS Registry Number) or 7712 (ChemSpider) you will quickly find garbage. Some identifiers are useful only with particular data sources or in circumstances where the data source is identified. (The idea of “universal” identifiers is a recurrent human fiction. See The Search for the Perfect Language, Eco.)

Which means, of course, we will need to capture the scope of identifiers.

I Is For Identifier

Thursday, May 28th, 2015

As you saw yesterday, Sam Hunting and I have a presentation at Balisage 2015 (Wednesday, August 12, 2015, 9:00 AM, if you are buying a one-day ticket), “Spreadsheets – 90+ million end user programmers with no comment tracking or version control.”

If you suspect the presentation has something to do with topic maps, take one mark for your house!

You will have to attend the conference to get the full monty but there are some ideas and motifs that I will be testing here before incorporating them into the paper and possibly the presentation.

The first one is a short riff on identifiers.

Omitting the hyperlinks, the Wikipedia article in identifiers says in part:

An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique class of objects, where the “object” or class may be an idea, physical [countable] object (or class thereof), or physical [noncountable] substance (or class thereof). The abbreviation ID often refers to identity, identification (the process of identifying), or an identifier (that is, an instance of identification). An identifier may be a word, number, letter, symbol, or any combination of those.

(emphasis in original)

It goes on to say:


In computer science, identifiers (IDs) are lexical tokens that name entities. Identifiers are used extensively in virtually all information processing systems. Identifying entities makes it possible to refer to them, which is essential for any kind of symbolic processing.

There is an interesting shift in that last quote. Did you catch it?

The first two sentences are talking about identifiers but the third shifts to “[i]identifying entities makes it possible to refer to them….” But single token identifiers aren’t the only means to identify an entity.

For example, a police record may identify someone by their Social Security Number and permit searching by that number, but it can also identify an individual by height, weight, eye/hair color, age, tatoos, etc.

But we have been taught from a very young age that I stands for Identifier, a single token that identifies an entity. Thus:

identifier2

Single identifiers are found in “virtually all information systems,” not to mention writing from all ages and speech as well. They save us a great deal of time by allowing us to say “President Obama” without having to enumerate all the other qualities that collectively identify that subject.

Of course, the problem with single token identifiers is that we don’t all use the same ones and sometimes use the same ones for different things.

So long as we remain fixated on bare identifiers:

identifier2

we will continue to see efforts to create new “persistent” identifiers. Not a bad idea for some purposes, but a rather limited one.

Instead of bare identifiers, what if we understood that identifiers stand in the place of all the qualities of the entities we wish to identify?

That is our identifiers were seen as being pregnant with the qualities of the entities they represent:

identifier-pregnant

For some purposes, like unique keys in a database, our identifiers can be seen as opaque identifiers, that’s all there is to see.

For other purposes, such as indexing across different identifiers, then our identifiers are pregnant with the qualities that identify the entities they represent.

If we look at the qualities of the entities represented by two or more identifiers, we may discover that the same identifier represents two different entities, or we may discover that two (or more) identifiers represent the same entities.

I think we need to acknowledge the allure of bare identifiers (the ones we think we understand) and their usefulness in many circumstances. We should also observe that identifiers are in fact pregnant with the qualities of the entities they represent, enabling use to distinguish the same identifier but different entity case and match different identifiers for the same entity.

Which type of identifier you need, bare or pregnant, depends upon your use case and requirements. Neither one is wholly suited for all purposes.

(Comments and suggestions are always welcome but especially on these snippets of material that will become part of a larger whole. On the artwork as well. I am trying to teach myself Gimp.)

Ephemeral identifiers for life science data

Wednesday, May 27th, 2015

10 Simple rules for design, provision, and reuse of persistent identifiers for life science data by Julie A. McMurray, et al. (35 others).

From the introduction:

When we interact, we use names to identify things. Usually this works well, but there are many familiar pitfalls. For example , the “morning star” and “evening star” are both names for the planet Venus. “The luminiferous ether” is a name for an entity which no one still thinks exists. There are many women named “Margaret”, some of whom go by “Maggie” and some of whom have changed their surnames. We use everyday conversational mechanisms to work around these problems successfully. Naming problems have plagued the life sciences since Linnaeus pondered the Norway spruce; in the much larger conversation that underlies the life sciences, problems with identifiers (Box 1) impede the flow and integrity of information. This is especially challenging within “synthesis research” disciplines such as systems biology, translational medicine, and ecology. Implementation – driven initiatives such as ELIXIR , BD2K, and others (Text S1) have therefore been actively working to understand and address underlying problems with identifiers.

Good, global-scale, persistent identifier design is harder than it appears, and is essential for data to be Findable, Accessible, Interoperable, and Reusable (Data FAIRport principles [1]). Digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Identifiers are further complicated by imprecise terminology and different forms (Box 1).

Of the identifier forms, Local Resource Identifiers (LRI) and their corresponding full Uniform Resource Identifiers (URIs) are still among the most commonly used and most problematic identifiers in the bio-data ecosystem. Other forms of identifiers such as Uniform Resource Name (URNs) are less impactful because of their current lack of uptake. Here, we build on emerging conventions and existing general recommendations [2,3] and summarise the identifier characteristics most important to optimising the flow and integrity of life-science data (Table 1). We propose actions to take in the identifier ‘green field’ and offer guidance for using real-world identifiers from diverse sources.

Truth be told, global, persistent identifier design is overreaching.

First, some identifiers are more widely used than others, but there are no globally accepted identifiers of any sort.

Second, “persistent” is undefined. Present identifiers (curies or URIs) have not persisted pre-Web identifiers. On what basis would you claim that future generations will persist our identifiers?

However, systems expect to be able to make references by single, opaque, identifiers and so the hunt goes on for a single identifier.

The more robust and in fact persistent approach is to have a bag of identifiers for any subject, where each identifier itself has a bag of properties associated with it.

That avoids the exclusion of old identifiers and hence historical records and avoids pre-exclusion of future identifiers, which come into use long after our identifier is no long the most popular one.

Systems can continue to use a single identifier, locally as it were but software where semantic integration is important, should use sets of identifiers to facilitate integration across data sources.

One Subject, Three Locators

Tuesday, May 5th, 2015

As you may know, the Library of Congress actively maintains its subject headings. Not surprising to anyone other than purveyors of fixed ontologies. New subjects appear, terminology changes, old subjects have new names, etc.

The Subject Authority Cooperative Program (SACO) has a mailing list:

About the SACO Listserv (sacolist@loc.gov)

The SACO Program welcomes all interested parties to subscribe to the SACO listserv. This listserv was established first and foremost to facilitate communication with SACO contributors throughout the world. The Summaries of the Weekly Subject Editorial Review Meeting are posted to enable SACO contributors to keep abreast of changes and know if proposed headings have been approved or not. The listserv may also be used as a vehicle to foster discussions on the construction, use, and application of subject headings. Questions posted may be answered by any list member and not necessarily by staff in the Cooperative Programs Section (Coop) or PSD. Furthermore, participants are encouraged to provide comments, share examples, experiences, etc.

On the list this week was the question:

Does anyone know how these three sites differ as sources for consulting approved subject lists?

http://www.loc.gov/aba/cataloging/subject/weeklylists/

http://www.loc.gov/aba/cataloging/subject/

http://classificationweb.net/approved-subjects/

Janis L. Young, Policy and Standards Division, Library of Congress replied:

Just to clarify: all of the links that you and Paul listed take you to the same Approved Lists. We provide multiple access points to the information in order to accommodate users who approach our web site in different ways.

Depending upon your goals, the Approved Lists could be treated as a subject that has three locators.

Fifty Words for Databases

Saturday, March 7th, 2015

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

Rethinking set theory

Monday, December 22nd, 2014

Rethinking set theory by Tom Leinster.

From the introduction:

Mathematicians manipulate sets with con fidence almost every day of their working lives. We do so whenever we work with sets of real or complex numbers, or with vector spaces, topological spaces, groups, or any of the many other set-based structures. These underlying set-theoretic manipulations are so automatic that we seldom give them a thought, and it is rare that we make mistakes in what we do with sets.

However, very few mathematicians could accurately quote what are often referred to as `the’ axioms of set theory. We would not dream of working with, say, Lie algebras without first learning the axioms. Yet many of us will go our whole lives without learning `the’ axioms for sets, with no harm to the accuracy of our work. This suggests that we all carry around with us, more or
less subconsciously, a reliable body of operating principles that we use when manipulating sets.

What if we were to write down some of these principles and adopt them as our axioms for sets? The message of this article is that this can be done, in a simple, practical way. We describe an
axiomatization due to F. William Lawvere [3, 4], informally summarized in Fig. 1. The axioms suffice for very nearly everything mathematicians ever do with sets. So we can, if we want, abandon the classical axioms entirely and use these instead.

Don’t try to read this after a second or third piece of pie. 😉

What captured my interest was the following:

The root of the problem is that in the frame-work of ZFC, the elements of a set are always sets too. Thus, given a set X, it always makes sense in ZFC to ask what the elements of the elements of X; are. Now, a typical set in ordinary mathematics is ℝ. But accost a mathematician at random and ask them `what are the elements of Π?’, and they will probably assume they misheard you, or ask you what you’re talking about, or else tell you that your question makes no sense. If forced to answer, they might reply that real numbers have no elements. But this too is in conflict with ZFC’s usage of `set’: if all elements of ℝ are sets, and they all have no elements, then they are all the empty set, from which it follows that all real numbers are equal. (emphasis added)

The author explores the perils of using “set” with two different meanings in ZFC and what it could mean to define “set” as it is used in practice by mathematicians.

For my part, the “…elements of a set are always sets too” resonates with the concept that all identifiers can be resolved into identifiers.

For example: firstName = Patrick.

The token firstName, despite its popularity on customs forms, is not a semantic primitive recognized by all readers. While for some processing purposes, by agents hired to delay, harass and harry tired travelers, firstName is sufficient, it can in fact be resolved into tuples that represent equivalences to firstName or provide additional information about that identifier.

For example:

name = "firstName"

alt = "given name"

alt = "forename"

alt = "Christian name"

Which slightly increases my chances of finding an equivalent, if I am not familiar with firstName. I say “slightly increases” because names of individual people are subject to a rich heritage of variation based on language, culture, custom, all of which have changed over time. The example is just a tiny number of possible alternatives possible in English.

When I say “…it can in fact be resolved…” should not be taken to require that every identifier be so resolved or that the resulting identifiers extend to some particular level of resolution. Noting that we could similarly expand forename or alt and the identifiers we find in their expansions.

The question that a topic maps designer has to answer is “what expansions of identifiers are useful for a particular set of uses?” Do the identifiers need to survive their current user? (Think legacy ETL.) Will the data need to be combined with data using other identifiers? Will queries need to be made across data sets with conflicting identifiers? Is there data that could be merged on a subject by subject basis? Is there any value in a subject by subject merging?

To echo a sentiment that I heard in Leading from the Back: Making Data Science Work at a UX-driven Business, it isn’t the fact you can merge information about a subject that’s important. It is the value-add to a customer that results from that merging that is important.

Value-add for customers before toys for IT.*

I first saw this in a tweet by onepaperperday.

*This is a tough one for me, given my interests in language and theory. But I am trying to do better.

The Coming Era of Egocentric Video Analysis

Tuesday, December 9th, 2014

The Coming Era of Egocentric Video Analysis

From the post:

Head-mounted cameras are becoming de rigueur for certain groups—extreme sportsters, cyclists, law enforcement officers, and so on. It’s not hard to find content generated in this way on the Web.

So it doesn’t take a crystal ball to predict that egocentric recording is set to become ubiquitous as devices such as Go-Pros and Google Glass become more popular. An obvious corollary to this will be an explosion of software for distilling the huge volumes of data this kind of device generates into interesting and relevant content.

Today, Yedid Hoshen and Shmuel Peleg at the Hebrew University of Jerusalem in Israel reveal one of the first applications. Their goal: to identify the filmmaker from biometric signatures in egocentric videos.

A tidbit that I was unaware of:

Some of these are unique, such as the gait of the filmmaker as he or she walks, which researchers have long known is a remarkably robust biometric indicator.”Although usually a nuisance, we show that this information can be useful for biometric feature extraction and consequently for identifying the user,” say Hoshen and Peleg.

Makes me wonder if I should wear a prosthetic device to alter my gait when I do appear in range of cameras. 😉

Works great with topic maps. All you may know about an actor is that they have some gait with X characteristics. And a perchance for not getting caught planting explosive devices. With a topic map we can keep their gait as a subject identifier and record all the other information we have on such an individual.

If we ever match the gait to a known individual, then the information from both records, both as the anonymous gait owner and the known known individual will be merged together.

It works with other characteristics as well, which enables you to work from “I was attacked…,” to more granular information that narrows the pool of suspects down to a manageable size.

Traditionally the job of veterans on the police force who know their communities and who are the usual suspects but a topic map enhances their value by capturing their observations for use by the department long after a veterans retirement.

From arXiv: Egocentric Video Biometrics

Abstract:

Egocentric cameras are being worn by an increasing number of users, among them many security forces worldwide. GoPro cameras already penetrated the mass market, and Google Glass may follow soon. As head-worn cameras do not capture the face and body of the wearer, it may seem that the anonymity of the wearer can be preserved even when the video is publicly distributed.
We show that motion features in egocentric video provide biometric information, and the identity of the user can be determined quite reliably from a few seconds of video. Biometrics are extracted by training Convolutional Neural Network (CNN) architectures on coarse optical flow.

Egocentric video biometrics can prevent theft of wearable cameras by locking the camera when worn by people other than the owner. In video sharing services, this Biometric measure can help to locate automatically all videos shot by the same user. An important message in this paper is that people should be aware that sharing egocentric video will compromise their anonymity.

Now if we could just get members of Congress to always carry their cellphones and wear body cameras.

Introduction to Basic Legal Citation (online ed. 2014)

Sunday, November 2nd, 2014

Introduction to Basic Legal Citation (online ed. 2014) by Peter W. Martin.

From the post:

This work first appeared in 1993. It was most recently revised in the fall of 2014 following a thorough review of the actual citation practices of judges and lawyers, the relevant rules of appellate practice of federal and state courts, and the latest edition of the ALWD Guide to Legal Citation, released earlier in the year. As has been true of all editions released since 2010, it is indexed to both the ALWD guide and the nineteenth edition of The Bluebook. However, it also documents the many respects in which contemporary legal writing, very often following guidelines set out in court rules, diverges from the citation formats specified by those academic texts.

The content of this guide is also available in three different e-book formats: 1) a pdf version that can be printed out in whole or part and also used with hyperlink navigation on an iPad or other tablet, indeed, on any computer; 2) a version designed specifically for use on the full range of Kindles as well as other readers or apps using the Mobi format; and 3) a version in ePub format for the Nook and other readers or apps that work with it. To access any of them, click here. (Over 50,000 copies of the 2013 edition were downloaded.)

Since the guide is online, its further revision is not tied to a rigid publication cycle. Any user seeing a need for clarification, correction, or other improvement is encouraged to “speak up.” What doesn’t work, isn’t clear, is missing, appears to be in error? Has a change occurred in one of the fifty states that should be reported? Comments of these and other kinds can sent by email addressed to peter.martin@cornell.edu. (Please include “Citation” in the subject line.) Many of the features and some of the coverage of this reference are the direct result of past user questions and advice.

A complementary series of video tutorials offers a quick start introduction to citation of the major categories of legal sources. They may also be useful for review. Currently, the following are available:

  1. Citing Judicial Opinions … in Brief (8.5 minutes)
  2. Citing Constitutional and Statutory Provisions … in Brief (14 minutes)
  3. Citing Agency Material … in Brief (12 minutes)

Finally, for those with an interest in current issues of citation practice, policy, and instruction, there is a companion blog, “Citing Legally,” at: http://citeblog.access-to-law.com.

Obviously legal citations are identifiers but Peter helpfully expands on the uses of legal citations:

A reference properly written in “legal citation” strives to do at least three things, within limited space:

  • identify the document and document part to which the writer is referring
  • provide the reader with sufficient information to find the document or document part in the sources the reader has available (which may or may not be the same sources as those used by the writer), and
  • furnish important additional information about the referenced material and its connection to the writer’s argument to assist readers in deciding whether or not to pursue the reference.

I would quibble with Peter’s description of a legal citation “identif[ing] a document or document part,” in part because of his second point, that a reader can find an alternative source for the document.

To me it is easier to say that legal citation identifies a legal decision, legislation or agency decision/rule, which may be reported by any number of sources. Some sources have their own unique references systems that are mapped to other systems. Making a legal decision, legislation or agency decision/rule an abstraction identified by the citation, avoids confusion with a particular source.

A must read for law students, practitioners, judges and potential inventors of the Nth citation system for legal materials.

Improving GitHub for science

Thursday, May 15th, 2014

Improving GitHub for science

From the post:

GitHub is being used today to build scientific software that’s helping find Earth-like planets in other solar systems, analyze DNA, and build open source rockets.

Seeing these projects and all this momentum within academia has pushed us to think about how we can make GitHub a better tool for research. As scientific experiments become more complex and their datasets grow, researchers are spending more of their time writing tools and software to analyze the data they collect. Right now though, these efforts often happen in isolation.

Citable code for academic software

Sharing your work is good, but collaborating while also getting required academic credit is even better. Over the past couple of months we’ve been working with the Mozilla Science Lab and data archivers, Figshare and Zenodo, to make it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

DOIs form the backbone of the academic reference and metrics system. With a DOI for your GitHub repository archive, your code becomes citable. Our newest Guide explains how to create a DOI for your repository.

A great step forward, but like http: pointing to entire resources, it is of limited utility.

Assume that I am using a DOI for a software archive and I want to point to and identify a code snippet in the archive that implements Fast Fourier Transform (FFT). My first task is to point to that snippet. A second task would be to create an association between the snippet and my annotation that it implements the Fast Fourier Transform. Yet a third task would be to gather up all the pointers that point to implementations of the Fast Fourier Transform (FFT).

For all of those tasks, I need to identify and point to a particular part of the underlying source code.

Unfortunately, a DOI is limited to identifying a single entity.

Each DOI® name is a unique “number”, assigned to identify only one entity. Although the DOI system will assure that the same DOI name is not issued twice, it is a primary responsibility of the Registrant (the company or individual assigning the DOI name) and its Registration Agency to identify uniquely each object within a DOI name prefix. (DOI Handbook

How would you extend the DOIs being used by GitHub to identify code fragments within source code repositories?

I first saw this in a tweet by Peter Desmet.

On InChI and evaluating the quality of cross-reference links

Saturday, April 19th, 2014

On InChI and evaluating the quality of cross-reference links by Jakub Galgonek and Jiří Vondrášek. (Journal of Cheminformatics 2014, 6:15 doi:10.1186/1758-2946-6-15)

Abstract:

Background

There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results

We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions

We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Another use case for topic maps don’t you think?

Rather than a mapping keyed on recognition of a single identifier, have the mapping keyed to the recognition of several key/value pairs.

I don’t think there is an abstract answer as to the optimum number of key/value pairs that must match for identification. Experience would be a much better guide.

Developing a 21st Century Global Library for Mathematics Research

Thursday, April 3rd, 2014

Developing a 21st Century Global Library for Mathematics Research by Committee on Planning a Global Library of the Mathematical Sciences.

Care to guess what one of the major problems facing mathematical research might be?

Currently, there are no satisfactory indexes of many mathematical objects, including symbols and their uses, formulas, equations, theorems, and proofs, and systematically labeling them is challenging and, as of yet, unsolved. In many fields where there are more specialized objects (such as groups, rings, fields), there are community efforts to index these, but they are typically not machine-readable, reusable, or easily integrated with other tools and are often lacking editorial efforts. So, the issue is how to identify existing lists that are useful and valuable and provide some central guidance for further development and maintenance of such lists. (p. 26)

Does that surprise you?

What do you think the odds are of mathematical research slowing down enough for committees to decide on universal identifiers for all the subjects in mathematical publications?

That’s about what I thought.

I have a different solution: Why not ask mathematicians who are submitting articles for publication to identity (specify properties for) what they consider to be the important subjects in their article?

The authors have the knowledge and skill, not to mention the motivation of wanting their research to be easily found by others.

Over time I suspect that particular fields will develop standard identifications (sets of properties per subject) that mathematicians can reuse to save themselves time when publishing.

Mappings across those sets of properties will be needed but that can be the task of journals, researchers and indexers who have an interest and skill in that sort of enterprise.

As opposed to having a “boil the ocean” approach that tries to do more than any one project is capable of doing competently.

Distributed subject identification is one way to think about it. We already do it, this would be a semi-formalization of that process and writing down what each author already knows.

Thoughts?

PS: I suspect the condition recited above is true for almost any sufficiently large field of study. A set of 150 million entities sounds large only without context. In the context of of science, it is a trivial number of entities.

Search Gets Smarter with Identifiers

Wednesday, March 19th, 2014

Search Gets Smarter with Identifiers

From the post:

The future of computing is based on Big Data. The vast collections of information available on the web and in the cloud could help prevent the next financial crisis, or even tell you exactly when your bus is due. The key lies in giving everything – whether it’s a person, business or product – a unique identifier.

Imagine if everything you owned or used had a unique code that you could scan, and that would bring you a wealth of information. Creating a database of billions of unique identifiers could revolutionise the way we think about objects. For example, if every product that you buy can be traced through every step in the supply chain you can check whether your food has really come from an organic farm or whether your car is subject to an emergency recall.

….

The difficulty with using big data is that the person or business named in one database might have a completely different name somewhere else. For example, news reports talk about Barack Obama, The US President, and The White House interchangeably. For a human being, it’s easy to know that these names all refer to the same person, but computers don’t know how to make these connections. To address the problem, Okkam has created a Global Open Naming System: essentially an index of unique entities like people, organisations and products, that lets people share data.

“We provide a very fast and effective way of discovering data about the same entities across a variety of sources. We do it very quickly,” says Paolo Bouquet. “And we do it in a way that it is incremental so you never waste the work you’ve done. Okkam’s entity naming system allows you to share the same identifiers across different projects, different companies, different data sets. You can always build on top of what you have done in the past.”

The benefits of a unique name for everything

http://www.okkam.org/

The community website: http://community.okkam.org/ reports 8.5+ million entities.

When the EU/CORDIS show up late for a party, it’s really late.

A multi-lingual organization like the EU, kudos on their efforts in that direction, should know uniformity of language or identifiers is only found in dystopian fiction.

I prefer the language and cultural richness of Europe over the sterile uniformity of American fast food chains. Same issue.

You?

I first saw this in a tweet by Stefano Bertolo.

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts

Thursday, January 23rd, 2014

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts by Tobias Kuhn and Michel Dumontier.

Abstract:

To make digital resources on the web verifiable, immutable, and permanent, we propose a technique to include cryptographic hash values in URIs. We show how such hash-URIs can be used for approaches like nanopublications to make not only specific resources but their entire reference trees verifiable. Digital resources can be identified not only on the byte level but on more abstract levels, which means that resources keep their hash values even when presented in a different format. Our approach sticks to the core principles of the web, namely openness and decentralized architecture, is fully compatible with existing standards and protocols, and can therefore be used right away. Evaluation of our reference implementations shows that these desired properties are indeed accomplished by our approach, and that it remains practical even for very large files.

I rather like the author’s summary of their approach:

our proposed approach boils down to the idea that references can be made completely unambiguous and veri able if they contain a hash value of the referenced digital artifact.

Hash-URIs (assuming proper generation) would be completely unambiguous and verifiable for digital artifacts.

However, the authors fail to notice two important issues with Hash-URIs:

  1. Hash-URIs are not human readable.
  2. Not being human readable means that mappings between Hash-URIs and other references to digital artifacts will be fragile and hard to maintain.

For example,

In prose an author will not say, “As found by “http://example.org/r1.RA5AbXdpz5DcaYXCh9l3eI9ruBosiL5XDU3rxBbBaUO70” (from the article).

In some publishing styles, authors will say: “…as a new way of scientifi c publishing [8].”

In other styles, authors will say: “Computable functions are therefore those “calculable by finite means” (Turing, 1936: 230).”

That is to say of necessity there will be a mapping between the unambiguous and verifiable reference (UVR) and the ones used by human authors/readers.

Moreover, should the mapping between UVRs and their human consumable equivalents be lost, recovery is possible but time consuming.

The author’s go to some lengths to demonstrate the use of Hash-URIs with RDF files. RDF is one approach among many to digital artifacts.

If the mapping issues between Hash-URIs and other identifiers can be addressed, a more general approach to digital artifacts would make this proposal more viable.

I first saw this in a tweet by Tobias Kuhn.

Wikidata in 2014 [stable identifiers]

Wednesday, January 22nd, 2014

Wikidata in 2014

From the development plans for Wikidata in 2014, it looks like a busy year.

There are a number of interesting work items but one in particular caught my attention:

Merges and redirects

bugzilla:57744 and bugzilla:38664

When two different items about the same topic are created they can be merged. Labels, descriptions, aliases, sitelinks and statements are merged if they do not conflict. The item that is left empty can then be turned into a redirect to the other. This way, Wikidata IDs can be regarded as stable identifiers by 3rd-parties.

As more data sets come online, preserving “stable identifiers” from each data set is going to be important. You can’t know in advance which data set a particular researcher may have used as a source of identifiers.

Here of course they are talking about “stable identifiers” inside of Wikidata.

In principle though, I don’t see any reason we can treat “foreign” identifiers as stable.

You?

Announcing Open LEIs:…

Tuesday, December 3rd, 2013

Announcing Open LEIs: a user-friendly interface to the Legal Entity Identifier system

From the post:

Today, OpenCorporates announces a new sister website, Open LEIs, a user-friendly interface on the emerging Global Legal Entity Identifier System.

At this point many, possibly most, of you will be wondering: what on earth is the Global Legal Entity Identifier System? And that’s one of the reasons why we built Open LEIs.

The Global Legal Entity Identifier System (aka the LEI system, or GLEIS) is a G20/Financial Stability Board-driven initiative to solve the issues of identifiers in the financial markets. As we’ve explained in the past, there are a number of identifiers out there, nearly all of them proprietary, and all of them with quality issues (specifically not mapping one-to-one with legal entities). Sometimes just company names are used, which are particularly bad identifiers, as not only can they be represented in many ways, they frequently change, and are even reused between different entities.

This problem is particularly acute in the financial markets, meaning that regulators, banks, market participants often don’t know who they are dealing with, affecting everything from the ability to process trades automatically to performing credit calculations to understanding systematic risk.

The LEI system aims to solve this problem, by providing permanent, IP-free, unique identifiers for all entities participating in the financial markets (not just companies but also municipalities who issue bonds, for example, and mutual funds whose legal status is a little greyer than companies).

The post cites five key features for Open LEIs:

  1. Search on names (despite slight misspellings) and addresses
  2. Browse the entire (100,000 record) database and/or filter by country, legal form, or the registering body
  3. A permanent URL for each LEI
  4. Links to OpenCorporate for additional data
  5. Data is available as XML or JSON

As the post points out, the data isn’t complete but dragging legal entities out into the light is never easy.

Use this resource and support it if you are interested in more and not less financial transparency.

Thinking, Fast and Slow (Review) [And Subject Identity]

Friday, November 15th, 2013

A statistical review of ‘Thinking, Fast and Slow’ by Daniel Kahneman by Patrick Burns.

From the post:

We are good intuitive grammarians — even quite small children intuit language rules. We can see that from mistakes. For example: “I maked it” rather than the irregular “I made it”.

In contrast those of us who have training and decades of experience in statistics often get statistical problems wrong initially.

Why should there be such a difference?

Our brains evolved for survival. We have a mind that is exquisitely tuned for finding things to eat and for avoiding being eaten. It is a horrible instrument for finding truth. If we want to get to the truth, we shouldn’t start from here.

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

The review focuses mainly on statistical issues in “Thinking Fast and Slow” but I think you will find it very entertaining.

I deeply appreciate Patrick’s quoting of:

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

In particular:

…relying on evidence that you can neither explain nor defend.

which resonates with me on subject identification.

Think about how we search for subjects, which of necessity involves some notion of subject identity.

What if a colleague asks if they should consult the records of the Order of the Garter to find more information on “Lady Gaga?”

Not entirely unreasonable since “Lady” is conferred upon female recipients of the Order of the Garter.

No standard search technique would explain why your colleague should not start with the Order of the Garter records.

Although I think most of us would agree such a search would be far afield. 😉

Every search starts with a searcher relying upon what they “know,” suspect or guess to be facts about a “subject” to search on.

At the end of the search, the characteristics of the subject as found, turn out to be the characteristics we were searching for all along.

I say all that to suggest that we need not bother users to say how in fact to identity the objects of their searches.

Rather the question should be:

What pointers or contexts are the most helpful to you when searching? (May or may not be properties of the search objective.)

Recalling that properties of the search objective are how we explain successful searches, not how we perform them.

Calling upon users to explain or make explicit what they themselves don’t understand, seems like a poor strategy for adoption of topic maps.

Capturing what “works” for a user, without further explanation or difficulty seems like the better choice.


PS: Should anyone ask about “Lady Gaga,” you can mention that Glamour magazine featured her on its cover, naming her Woman of the Year (December 2013 issue). I know that only because of a trip to the local drug store for a flu shot.

Promised I would be “in and out” in minutes. Literally true I suppose, it only took 50 minutes with four other people present when I arrived.

I have a different appreciation of “minutes” from the pharmacy staff. 😉

…OCLC Control Numbers Public Domain

Tuesday, September 24th, 2013

OCLC Declare OCLC Control Numbers Public Domain by Richard Wallis.

From the post:

I signalled that OCLC were looking at this, in my session (Linked Data Progress), at IFLA in Singapore a few weeks ago. I am now pleased to say that the wording I was hinting at has now appeared on the relevant pages of the OCLC web site:

Use of the OCLC Control Number (OCN)
OCLC considers the OCLC Control Number (OCN) to be an important data element, separate from the rest of the data included in bibliographic records. The OCN identifies the record, but is not part of the record itself. It is used in a variety of human and machine-readable processes, both on its own and in subsequent manipulations of catalog data. OCLC makes no copyright claims in individual bibliographic elements nor does it make any intellectual property claims to the OCLC Control Number. Therefore, the OCN can be treated as if it is in the public domain and can be included in any data exposure mechanism or activity as public domain data. OCLC, in fact, encourages these uses as they provide the opportunity for libraries to make useful connections between different bibliographic systems and services, as well as to information in other domains.

The announcement of this confirmation/clarification of the status of OCNs was made yesterday by my colleague Jim Michalko on the Hanging Together blog.

See: OCLC Control Number if you are interested in the details of OCNs (which are interesting in and of themselves).

Unlike the Perma.cc links, OCNs are not tied to any particular network protocol.

However you deliver an OCN, by postcard, phone or network query, an information system can respond with the information that corresponds to that OCN.

No one can promise you “forever,” but not tying identifiers to ephemeral network protocols is one way to get closer to “forever.”

Miriam Registry [More Identifiers For Science]

Monday, April 15th, 2013

Miriam Registry

From the homepage:

Persistent identification for life science data

The MIRIAM Registry provides a set of online services for the generation of unique and perennial identifiers, in the form of URIs. It provides the core data which is used by the Identifiers.org resolver.

The core of the Registry is a catalogue of data collections (corresponding to controlled vocabularies or databases), their URIs and the corresponding physical URLs or resources. Access to this data is made available via exports (XML) and Web Services (SOAP).

And from the FAQ:

What is MIRIAM, and what does it stand for?

MIRIAM is an acronym for the Minimal Information Required In the Annotation of Models. It is important to distinguish between the MIRIAM Guidelines, and the MIRIAM Registry. Both being part of the wider BioModels.net initiative.

What are the ‘MIRIAM Guidelines’?

The MIRIAM Guidelines are an effort to standardise upon the essential, minimal set of information that is sufficient to annotate a model in such a way as to enable its reuse. This includes a means to identify the model itself, the components of which it is composed, and formalises a means by which unambiguous annotation of components should be encoded. This is essential to allow collaborative working by different groups which may not be spatially co-located, and facilitates model sharing and reuse by the general modelling community. The goal of the project, initiated by the BioModels.net effort, was to produce a set of guidelines suitable for model annotation. These guidelines can be implemented in any structured format used to encode computational models, for example SBML, CellML, or NeuroML . MIRIAM is a member of the MIBBI family of community-developed ‘minimum information’ reporting guidelines for the biosciences.

More information on the requirements to achieve MIRIAM Guideline compliance is available on the MIRIAM Guidelines page.

What is the MIRIAM Registry?

The MIRIAM Registry provides the necessary information for the generation and resolving of unique and perennial identifiers for life science data. Those identifiers are of the URI form and make use of Identifiers.org for providing access to the identified data records on the Web. Examples of such identifiers: http://identifiers.org/pubmed/22140103, http://identifiers.org/uniprot/P01308, …

More identifiers for the life sciences, for those who choose to use them.

The curation may be helpful in terms of mappings to other identifiers.

Finding Subject Identifiers

Monday, April 1st, 2013

A recent comment made it clear that tooling, or the lack thereof, is a real issue for topic maps.

Here is my first suggestion of a tool you can use while authoring a topic map:

Wikipedia.

Seriously, think about it. You want a URL that identifies subject X.

Granting that Wikipedia is a fairly limited set of subjects, it is at least a starting point.

Example: I want a subject identifier for “Donald Duck,” a cartoon character.

I can use the search box at Wikipedia or I can type in a browser:

http://en.wikipedia.org/wiki/Donald%20Duck

Go ahead, try it.

If I don’t know the full name:

http://en.wikipedia.org/wiki/Donald

What do you think?

Allows you to disambiguate Donalds, at least the ones that Wikipedia knows about.

Not to mention giving you access to other subjects and relationships that may be of interest for your topic map.

To include foreign language materials (outside of English only non-thinking zones in the U.S.), try a different language Wikipedia:

http://de.wikipedia.org/wiki/Donald%20Duck

Finding subject identifiers won’t write your topic map for you but can make the job easier.

There are other sources of subject identifiers so send in your suggestions and any syntax short-cuts for accessing them.


You have no doubt read that URIs used as identifiers are supposed to be semi-permanent, “cool,” etc.

But identifiers change over time. It’s one of the reasons for historical semantic diversity.

URIs as identifiers will change as well.

Good thing topic maps enable you to have multiple identifiers for any subject.

Means old references to old identifiers still work.

Glad we dodged having to redo and reproof all those old connections.

Aren’t you?

User evaluation of automatically generated keywords and toponyms… [of semantic gaps]

Tuesday, January 22nd, 2013

User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)

Abstract:

This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.

An echo of Steve Newcomb’s semantic impedance appears at:

Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identified by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders defined the semantic gap as

the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)

In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)

That last sentence could have come out of a topic map book.

Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”

If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”

The IUPAC International Chemical Identifier (InChI)….

Saturday, January 5th, 2013

The IUPAC International Chemical Identifier (InChI) and its influence on the domain of chemical information edited by Dr. Anthony Williams.

From the webpage:

The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to deduplicate, validate and link together chemical compounds and related information across databases. Its influence has been especially valuable as the internet has exploded in terms of the amount of chemistry related information available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

If you are interested in chemistry/cheminformatics or in the development and use of identifers, this is an issue to not miss!

You will find:

InChIKey collision resistance: an experimental testing by Igor Pletnev, Andrey Erin, Alan McNaught, Kirill Blinov, Dmitrii Tchekhovskoi, Steve Heller.

Consistency of systematic chemical identifiers within and between small-molecule databases by Saber A Akhondi, Jan A Kors, Sorel Muresan.

InChI: a user’s perspective by Steven M Bachrach.

InChI: connecting and navigating chemistry by Antony J Williams.

I particularly enjoyed Steven Bachrach’s comment:

It is important to recognize that in no way does InChI replace or make outmoded any other chemical identifier. A company that has developed their own registry system or one that uses one of the many other identifiers, like a MOLfile [13], can continue to use their internal system. Adding the InChI to their system provides a means for connecting to external resources in a simple fashion, without exposing any of their own internal technologies.

Or to put it differently, InChl increased the value of existing chemical identifiers.

How’s that for a recipe for adoption?

Identifiers, 404s and Document Security

Wednesday, December 12th, 2012

I am working on a draft about identifiers (using the standard <a> element) when it occurred to me that URLs could play an unexpected role in document security. (At least unexpected by me, your mileage may vary.)

What if I create a document that has URLs like:

<a href="http://server-exists.x/page-does-not.html>text content</a>

So that a user who attempts to follow the link, gets a “404” message back.

Why is that important?

What if I am writing HTML pages at a nuclear weapon factory? I would be very interested in knowing if one of my pages had gotten off the reservation so to speak.

The server being accessed for a page that deliberately does not exist could route the contact information for an appropriate response.

Of course, I would use better names or have pages that load, while transmitting the same contact information.

Or have a very large uuencoded “password” file that burps, bumps and slowly downloads. (Always knew there was a reason to keep a 2400 baud modem around.)

Have suggestions on how to make a non-existent URL work but will save that for another day.

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Friday, October 26th, 2012

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.

You?

Visual Clues: A Brain “feature,” not a “bug”

Saturday, September 29th, 2012

You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:

You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.

A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.

“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.

“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)

I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed! 😉

But that’s the trick of subject identification/identity isn’t it?

That our brains “recognize” all manner of subjects without any effort on our part.

Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.

Or rather, making it more work than we are usually willing to devote to digging it out.

When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.

Two quick points:

First, need to think about how to incorporate this “feature” into delivery interfaces for users.

Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)

Dancing With Dirty Data Thanks to SAP Visual Intelligence [Kinds of Dirty?]

Saturday, September 22nd, 2012

Dancing With Dirty Data Thanks to SAP Visual Intelligence by Timo Elliott.

From the post:

(graphic omitted)

Here’s my entry for the SAP Ultimate Data Geek Challenge, a contest designed to “show off your inner geek and let the rest of world know your data skills are second to none.” There have already been lots of great submissions with people using the new SAP Visual Intelligence data discovery product.

I thought I’d focus on one of the things I find most powerful: the ability to create visualizations quickly and easily even from real-life, messy data sources. Since it’s election season in the US, I thought I’d use some polling data on whether voters believe the country is “headed in the right direction.” There is lots of different polling data on this (and other topics) available at pollingreport.com.

Below you can see the data set I grabbed: as you can see, the polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent (the month is not always included, sometimes spaces around the middle dash, sometimes not…).

Take a closer look at Timo’s definition of “dirty” data: “…polling date field is particularly messy, since it has extra letters (e.g. RV for “registered voter”), includes polls that were carried out over several days, and is not consistent….”

Sure, that’s “dirty” data all right, but only one form of dirty data. It is dirty data that arises from typographical inconsistency. Inconsistency that prevents reliable automated processing.

Another form of dirty data arises from identifier inconsistency. That is one or more identifiers are used for the same subject, and/or the same identifier is used for different subjects.

I take the second form, identifier inconsistency to be distinct from typographical inconsistency. Can turn out to overlap but conceptually I find it helpful to distinguish the two.

Resolution of either form of inconsistency requires judgement about the reference being made by the identifiers.

Question: If you are resolving typographical inconsistency, do you keep a map of the resolution? If not, why not?

Question: Same questions for identifier inconsistency.

Identities and Identifications: Politicized Uses of Collective Identities

Monday, September 17th, 2012

Identities and Identifications: Politicized Uses of Collective Identities

Deadline for Panels 15 January 2013
Deadline for Papers 1 March 2013
Conference 18-20 April 2013, Zagreb, Croatia

From the call for panels and papers:

Identity is one of the crown jewelleries in the kingdom of ‘contested concepts’. The idea of identity is conceived to provide some unity and recognition while it also exists by separation and differentiation. Few concepts were used as much as identity for contradictory purposes. From the fragile individual identities as self-solidifying frameworks to layered in-group identifications in families, orders, organizations, religions, ethnic groups, regions, nation-states, supra-national entities or any other social entities, the idea of identity always shows up in the core of debates and makes everything either too dangerously simple or too complicated. Constructivist and de-constructivist strategies have led to the same result: the eternal return of the topic. Some say we should drop the concept, some say we should keep it and refine it, some say we should look at it in a dynamic fashion while some say it’s the reason for resistance to change.

If identities are socially constructed and not genuine formations, they still hold some responsibility for inclusion/exclusion – self/other nexuses. Looking at identities in a research oriented manner provides explanatory tolls for a wide variety of events and social dynamics. Identities reflect the complex nature of human societies and generate reasonable comprehension for processes that cannot be explained by tracing pure rational driven pursuit of interests. The feelings of attachment, belonging, recognition, the processes of values’ formation and norms integration, the logics of appropriateness generated in social organizations are all factors relying on a certain type of identity or identification. Multiple identifications overlap, interact, include or exclude, conflict or enhance cooperation. Identities create boundaries and borders; define the in-group and the out-group, the similar and the excluded, the friend and the threatening, the insider and the ‘other’.

Beyond their dynamic fuzzy nature that escapes exhaustive explanations, identities are effective instruments of politicization of social life. The construction of social forms of organization and of specific social practices together with their imaginary significations requires all the time an essentialist or non-essentialist legitimating act of belonging; a social glue that extracts its cohesive function from the identification of the in-group and the power of naming the other. Identities are political. Multicultural slogans populate extensively the twenty-first century yet the distance between the ideal and the real multiculturalism persists while the virtues of inclusion coexist with the adversity of exclusion. Dealing with the identities means to integrate contestation into contestation until potentially a n degree of contestation. Due to the confusion between identities and identifications some scholars demanded that the concept of identity shall be abandoned. Identitarian issues turned out to be efficient tools for politicization of a ‘constraining dissensus’ while universalizing terms included in the making of the identities usually tend or intend to obscure the localized origins of any identitarian project. Identities are often conceptually used as rather intentional concepts: they don’t say anything about their sphere but rather defining the sphere makes explicit the aim of their usage. It is not ‘identity of’ but ‘identity to’.

Quick! Someone get them a URL! 😉 Just teasing.

Enjoy the conference!

Author Identifiers (arXiv.org) [> one (1) identifier per subject]

Monday, September 10th, 2012

I happened upon an author who used an arXiv.org author identifier at their webpage.

From the arXiv.org page:

It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv. Such identification would provide accurate results for queries such as "show me all the other papers by the particular John Smith that wrote this paper", something that can be done only approximately with text-based searches. It would also permit construction of an author-article graph which is useful for relevance
assessment and bibliometric analysis.

Since 2005 arXiv has used authority records that associate user accounts with articles authored by that user. These records support the endorsement system. The use of public author identifiers as a way to build services upon this data is new in 2009. Initially, users must opt-in to have a public author identifier and to expose the record of their articles on arXiv for use in other services. At some later date we hope to be able to improve our authority records to the point where we can create public author identifiers for all authors of arXiv articles without needing to enlist the help of each author to check their record before opting in.

The services we offer based on author identifiers are:

Significant enough in its own right but note the plans for the future:

The following enhancements and interoperability features are planned:

  • arXiv will permit authors to record other identifiers they have in other schemes and include these in the data feeds. This will allow agents and systems to link together the same author in different databases.
  • arXiv will support mechanisms for handling name changes, combination of accidentally created duplicates and separation of accidentally combined identifiers.

Recoding other identifiers? What? Acknowledge that there can be more than one identifier (yours) per subject? Blasphemy! 😉

Go arXiv.org!

I am sure suggestions, support, contributions, etc., would be most welcome.

Author Identifiers (At Least for CS)

Tuesday, September 4th, 2012

I enhanced the VLDB 2012 program with author queries to the DBLP Computer Science Bibliography for my own purposes.

After using that listing myself for a few days, it occurred to me that I should be using DBLP entries as author identifiers throughout my posts, at least when such entries exist.

For several reasons, but mostly:

  • DBLP maintains the publication listings (not by me!)
  • DBLP maintains pointers to other databases and resources (also not by me!)
  • DBLP maintains advanced search capabilities beyond authors (again, not by me!)

If you noticed not by me forming a pattern, you would be correct. There is a pattern.

The pattern?

Using DBLP author pages as identifiers, I leverage on (not duplicate) the work of the DBLP project.

To the benefit of my readers. (Not to mention myself.)

The DBLP link brings an author’s publication history, their co-authors, and additional bibliographic resources. (That’s a triple I like.)

It takes a moment to insert the link but the payoff is substantial.

When you cite a CS author in your blog, include their DBLP link. We will all thank you for it.

(I did that once upon a time but lapsed. Will be cleaning up older entries and trying to do better in the future.)

PS: Similar sources of identifiers for other disciplines?