Archive for the ‘Identification’ Category

Open Ownership Project

Thursday, November 9th, 2017

Open Ownership Project

From about page:

OpenOwnership is driven by a steering group composed of leading transparency NGOs, including Global Witness, Open Contracting Partnership, Web Foundation, Transparency International, the ONE Campaign, and the B Team, as well as OpenCorporates.

OpenOwnership’s central goal is to build an open Global Beneficial Ownership Register, which will serve as an authoritative source of data about who owns companies, for the benefit of all. This data will be global and linked across jurisdictions, industries, and linkable to other datasets too.

Alongside the register, OpenOwnership is developing a universal and open data standard for beneficial ownership, providing a solid conceptual and practical foundation for collecting and publishing beneficial ownership data.

I first visited the Open Ownership Project site following two (of four) posts on verifying beneficial ownership.

What we really mean when we talk about verification (Part 1 of 4) by Zosia Sztykowski and Chris Taggart.

From the post:

This is the first of a series of blog posts in which we will discuss the critical but tricky issue of verification, particularly with respect to beneficial ownership.

‘Verification’ is frequently said to be a critical step in generating high-quality beneficial ownership information. What’s less clear is what is actually meant by verification, and what are the key factors in the process. In fact, verification is not one step, but three:

  1. Ensuring that the person making a statement about beneficial ownership is who they say they are, and that they have the right to make the claim (authentication and authorization);

  2. Ensuring that the data submitted is a legitimate possible value (validation);

  3. Verifying that the statement made is actually true (which we will call truth verification).

Another critical factor is whether these processes are done on individual filings, typically hand-written pieces of paper, or their PDF equivalents, or whole datasets of beneficial ownership data. While verification processes are possible on individual filings, this series will show that that public, digital, structured beneficial ownership data adds an additional layer of verification not possible with traditional filings.

Understanding precisely how verification takes place in the lifecycle of a beneficial ownership datum is an important step in knowing what beneficial ownership data can tell us about the world. Each of the stages above will be covered in more detail in this series, but let’s linger on the final one for a moment.

What we really mean when we talk about verification: Authentication & authorization (Part 2 of 4)

In the first post in this series on the principles of verification, particularly relating to beneficial ownership, we explained why there is no guarantee that any piece of beneficial ownership data is the absolute truth.

The data collected is still valuable, however, providing it is made available publicly as open data, as it exposes lies and half-truths to public scrutiny, raising red flags that indicate potential criminal or unethical activity.

We discussed a three-step process of verification:

  1. Ensuring that the person making a statement about beneficial ownership is who they say they are (authentication), and that they have the right to make the claim (authorization);

  2. Ensuring that the data submitted is a legitimate possible value (validation);

  3. Verifying that the statement made is actually true (which we will call truth verification).

In this blog post, we will discuss the first of these, focusing on how to tell who is actually making the claims, and whether they are authorized to do so.

When authentication and authorization have been done, you can approach the information with more confidence. Without them, you may have little better than anonymous statements. Critically, with them, you can also increase the risks for those who wish to hide their true identities and the nature of their control of companies.

Parts 3 and 4 are forthcoming (as of 9 November 2017).

A beta version of the Beneficial Ownership Data Standard (BODS) was released last April (2017). A general overview appeared in June, 2017: Introducing the Beneficial Ownership Data Standard.

Identity issues are rife in ownership data so when planning your volunteer activity for 2018, keep the Open Ownership project in mind.

Another Betrayal By Cellphone – Personal Identity

Sunday, June 26th, 2016

Normal operation of the cell phone in your pocket betrays your physical location. Your location is calculated by a process known as cell phone tower triangulation. In addition to giving away your location, research shows your cell phone can betray your personal identity as well.

The abstract from: Person Identification Based on Hand Tremor Characteristics by Oana Miu, Adrian Zamfir, Corneliu Florea, reads:

A plethora of biometric measures have been proposed in the past. In this paper we introduce a new potential biometric measure: the human tremor. We present a new method for identifying the user of a handheld device using characteristics of the hand tremor measured with a smartphone built-in inertial sensors (accelerometers and gyroscopes). The main challenge of the proposed method is related to the fact that human normal tremor is very subtle while we aim to address real-life scenarios. To properly address the issue, we have relied on weighted Fourier linear combiner for retrieving only the tremor data from the hand movement and random forest for actual recognition. We have evaluated our method on a database with 10 000 samples from 17 persons reaching an accuracy of 76%.

The authors emphasize the limited size of their dataset and unexplored issues, but with an accuracy of 76% in identification mode and 98% in authentication (matching tremor to user in the database) mode, this approach merits further investigation.

Recording tremor data required no physical modification of the cell phones, only installation of an application that captured gyroscope and accelerometer data.

Before the targeting community gets too excited about having cell phone location and personal identify via tremor data, the authors do point out that personal tremor data can be recorded and used to defeat identification.

It maybe that hand tremor isn’t the killer identification mechanism but what if it were considered to be one factor of identification?

That is that hand tremor, plus location (say root terminal), plus a password, are all required for a successful login.

Building on our understanding from topic maps that identification isn’t ever a single factor, but can be multiple factors in different perspectives.

In that sense, two-factor identification demonstrates how lame our typical understanding of identity is in fact.

Knowing the Name of Something vs. Knowing How To Identify Something

Wednesday, November 18th, 2015

Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something

From the post:

In this short clip (below), Feynman articulates the difference between knowing the name of something and understanding it.

See that bird? It’s a brown-throated thrush, but in Germany it’s called a halzenfugel, and in Chinese they call it a chung ling and even if you know all those names for it, you still know nothing about the bird. You only know something about people; what they call the bird. Now that thrush sings, and teaches its young to fly, and flies so many miles away during the summer across the country, and nobody knows how it finds its way.

Knowing the name of something doesn’t mean you understand it. We talk in fact-deficient, obfuscating generalities to cover up our lack of understanding.

You won’t get to see the Feynman quote live because it has been blocked by BBC Worldwide on copyright grounds. No doubt they make a bag full of money every week off that 179 second clip of Feynman.

The stronger point for Feynman would be to point out that you can’t recognize anything on the basis of knowing a name.

I may be sitting next to Cindy Lou Who on the bus but knowing her name isn’t going to help me to recognize her.

Knowing the name of someone or something isn’t useful unless you know something about the person or thing you associate with a name.

That is you know when it is appropriate to use the name you have learned and when to say: “Sorry, I don’t know your name or the name of (indicating in some manner).” At which point you will learn a new name and store a new set of properties to know when to use that name, instead of any other name you know.

Everyone does that exercise, learning new names and the properties that establish when it is appropriate to use a particular name. And we do so seamlessly.

So seamlessly that when called upon to make explicit “how” we know which name to use, subject identification in other words, it takes a lot of effort.

It’s enough effort that it should be done only when necessary and when we can show the user an immediate semantic ROI for their effort.

More on this to follow.

You do not want to be an edge case [The True Skynet: Your Homogenized Future]

Friday, November 13th, 2015

You do not want to be an edge case.

John D. Cook writes:

Hilary Mason made an important observation on Twitter a few days ago:

You do not want to be an edge case in this future we are building.

Systems run by algorithms can be more efficient on average, but make life harder on the edge cases, people who are exceptions to the system developers’ expectations.

Algorithms, whether encoded in software or in rigid bureaucratic processes, can unwittingly discriminate against minorities. The problem isn’t recognized minorities, such as racial minorities or the disabled, but unrecognized minorities, people who were overlooked.

For example, two twins were recently prevented from getting their drivers licenses because DMV software couldn’t tell their photos apart. Surely the people who wrote the software harbored no malice toward twins. They just didn’t anticipate that two drivers licence applicants could have indistinguishable photos.

I imagine most people reading this have had difficulty with software (or bureaucratic procedures) that didn’t anticipate something about them; everyone is an edge case in some context. Maybe you don’t have a middle name, but a form insists you cannot leave the middle name field blank. Maybe there are more letters in your name or more children in your family than a programmer anticipated. Maybe you choose not to use some technology that “everybody” uses. Maybe you happen to have a social security number that hashes to a value that causes a program to crash.

When software routinely fails, there obviously has to have a human override. But as software improves for most people, there’s less apparent need to make provision for the exceptional cases. So things could get harder for edge cases as they get better for more people.

Recent advances in machine learning have led reputable thinkers (Steven Hawking for example) to envision a future where an artificial intelligence will arise to dispense with humanity.

If you think you have heard that theme before, you have, most recently as Skynet, an entirely fictional creation in the Terminator science fiction series.

Given that no one knows how the human brain works, much less how intelligence arises, despite such alarmist claims making good press, the risk is less than a rogue black hole or a gamma-ray burst. I don’t lose sleep over either one of those, do you?

The greater “Skynet” threat to people and their cultures is the enforced homogenization of language and culture.

John mentions lacking a middle name but consider the complexities of Japanese names. Due to the creeping infection of Western culture and computer-based standardization, many Japanese list their names in Western order, given name, family name, instead of the Japanese order of family name, given name.

Even languages can start the slide to being “edge cases,” as you will see from the erosion of Hangul (Korean alphabet) from public signs in Seoul.

Computers could be preserving languages and cultural traditions, they have the capacity and infinite patience.

But they are not being used for that purpose.

Cellphones, for example, are linking humanity into a seething mass of impoverished social interaction. Impoverished social interaction that is creating more homogenized languages, not preserving diverse ones.

Not only should you be an edge case but you should push back against the homogenizing impact of computers. The diversity we lose could well be your own.

Identifiers as Shorthand for Identifications

Tuesday, June 2nd, 2015

I closed Identifiers vs. Identifications? saying:

Many questions remain, such as how to provide for collections of sets “of properties which provide clues for establishing identity?,” how to make those collections extensible?, how to provide for constraints on such sets?, where to record “matching” (read “merging”) rules?, what other advantages can be offered?

In answering those questions, I think we need to keep in mind that identifiers and identifications lie along a continuum that runs from where we “know” what is meant by an identifier to where we ourselves need a full identification to know what is being discussed. A useful answer won’t be one or the other, but a pairing that suits a particular circumstance and use case.

You can also think of identifiers as a form of shorthand for an identification. If we were working together in a fairly small office, you would probably ask, “Is Patrick in?” rather than listing all the properties that would serve as an identification for me. So all the properties that make up an identification are unspoken but invoked by the use of the identifier.

Works quite well in a small office because to some varying degree, we would all share the identifications that are represented by the identifiers we use in everyday conversation.

That sharing of identifications behind identifiers doesn’t happen in information systems, unless we have explicitly added identifications behind those identifiers.

One problem we need to solve is how to associate an identification with an identifier or identifiers. Looking only slightly ahead, we could use an explicit mechanism like a TMDM association, if we wanted to be able to talk about the subject of the relationship between an identifier and the identification that lies behind it.

But we are not compelled to talk about such a subject and could declare by rule that within a container, an identifier is a shorthand for properties of an identification in the same container. That assumes the identifier is distinguished from the properties that make up the identification. I don’t think we need to reinvent the notions of essential vs. accidental properties but merging rules should call out what properties are required for merging.

The wary reader will have suspected before now that many (if not all) of the terms in such a container could be considered as identifiers in and of themselves. Suddenly they are trying to struggle uphill from a swamp of subject recursion. It is “elephants all the way down.”

Have no fear! Just as we can avoid using TMDM associations to mark the relationship between an identifier and the properties making up an identification, we need use containers for identifiers and identifications only when and where we choose.

In some circumstances we may use bare identifiers, sans any identifications and yet add identifications when circumstances warrant it.

No level, identifiers, an identification, an identification that explodes other identifiers, etc., is right for every purpose. Each may be appropriate for some particular purpose.

We need to allow for downward expansion in the form of additional containers along side the containers we author, as well as extension of containers to add sub-containers for identifiers and identifications we did not or chose not to author.

I do have an underlying assumption that may reassure you about the notion of downward expansion of identifier/identification containers:

Processing of one or more containers of identifiers and identifications can choose the level of identifiers + identifications to be processed.

For some purposes I may only want to choose “top level” identifiers and identifications or even just parts of identifications. For example, think of the simple mapping of identifiers that happens in some search systems. You may examine the identifications for identifiers and then produce a bare mapping of identifiers for processing purposes. Or you may have rules for identifications that produce a mapping of identifiers.

Let’s assume that I want to create a set of the identifiers for Pentane and so I query for the identifiers that have the molecular property C5H12. Some of the identifiers (with their scopes) returned will be: Beilstein Reference 969132, CAS Registry Number 109-66-0, ChEBI CHEBI:37830, ChEMBL ChEMBL16102, ChemSpider 7712, DrugBank DB03119.

Each one of those identifiers may have other properties in their associated identifications, but there is no requirement that I produce them.

I mentioned that identifiers have scope. If you perform a search on “109-66-0” (CAS Registry Number) or 7712 (ChemSpider) you will quickly find garbage. Some identifiers are useful only with particular data sources or in circumstances where the data source is identified. (The idea of “universal” identifiers is a recurrent human fiction. See The Search for the Perfect Language, Eco.)

Which means, of course, we will need to capture the scope of identifiers.

Fifty Words for Databases

Saturday, March 7th, 2015

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

Deep Visual-Semantic Alignments for Generating Image Descriptions

Friday, November 21st, 2014

Deep Visual-Semantic Alignments for Generating Image Descriptions by Andrej Karpathy and Li Fei-Fei.

From the webpage:

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.

Excellent examples with generated text. Code and other predictions “coming soon.”

For the moment you can also read the research paper: Deep Visual-Semantic Alignments for Generating Image Descriptions

Serious potential in any event but even more so if the semantics of the descriptions could be captured and mapped across natural languages.

Improving GitHub for science

Thursday, June 19th, 2014

Improving GitHub for science

From the post:

GitHub is being used today to build scientific software that’s helping find Earth-like planets in other solar systems, analyze DNA, and build open source rockets.

Seeing these projects and all this momentum within academia has pushed us to think about how we can make GitHub a better tool for research. As scientific experiments become more complex and their datasets grow, researchers are spending more of their time writing tools and software to analyze the data they collect. Right now though, these efforts often happen in isolation.

Citable code for academic software

Sharing your work is good, but collaborating while also getting required academic credit is even better. Over the past couple of months we’ve been working with the Mozilla Science Lab and data archivers, Figshare and Zenodo, to make it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

DOIs form the backbone of the academic reference and metrics system. With a DOI for your GitHub repository archive, your code becomes citable. Our newest Guide explains how to create a DOI for your repository.

A move in the right direction to be sure but how much of a move is open to question.

Think of a DOI as the equivalent to a International Standard Book Number (ISBN). Using that as an identifier, you are sure to find a book that I cite.

But if the book is several hundred pages long, you may find my “citing it” by an ISBN identifier alone isn’t quite good enough.

The same will be true for some citations using DOIs for Github repositories. Better than nothing at all, but falls short of a robust identifier for material within a Github archive.

I first saw this in a tweet by Peter Kraker.

Developing a 21st Century Global Library for Mathematics Research

Thursday, April 3rd, 2014

Developing a 21st Century Global Library for Mathematics Research by Committee on Planning a Global Library of the Mathematical Sciences.

Care to guess what one of the major problems facing mathematical research might be?

Currently, there are no satisfactory indexes of many mathematical objects, including symbols and their uses, formulas, equations, theorems, and proofs, and systematically labeling them is challenging and, as of yet, unsolved. In many fields where there are more specialized objects (such as groups, rings, fields), there are community efforts to index these, but they are typically not machine-readable, reusable, or easily integrated with other tools and are often lacking editorial efforts. So, the issue is how to identify existing lists that are useful and valuable and provide some central guidance for further development and maintenance of such lists. (p. 26)

Does that surprise you?

What do you think the odds are of mathematical research slowing down enough for committees to decide on universal identifiers for all the subjects in mathematical publications?

That’s about what I thought.

I have a different solution: Why not ask mathematicians who are submitting articles for publication to identity (specify properties for) what they consider to be the important subjects in their article?

The authors have the knowledge and skill, not to mention the motivation of wanting their research to be easily found by others.

Over time I suspect that particular fields will develop standard identifications (sets of properties per subject) that mathematicians can reuse to save themselves time when publishing.

Mappings across those sets of properties will be needed but that can be the task of journals, researchers and indexers who have an interest and skill in that sort of enterprise.

As opposed to having a “boil the ocean” approach that tries to do more than any one project is capable of doing competently.

Distributed subject identification is one way to think about it. We already do it, this would be a semi-formalization of that process and writing down what each author already knows.


PS: I suspect the condition recited above is true for almost any sufficiently large field of study. A set of 150 million entities sounds large only without context. In the context of of science, it is a trivial number of entities.

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts

Thursday, January 23rd, 2014

Hash-URIs for Verifiable, Immutable, and Permanent Digital Artifacts by Tobias Kuhn and Michel Dumontier.


To make digital resources on the web verifiable, immutable, and permanent, we propose a technique to include cryptographic hash values in URIs. We show how such hash-URIs can be used for approaches like nanopublications to make not only specific resources but their entire reference trees verifiable. Digital resources can be identified not only on the byte level but on more abstract levels, which means that resources keep their hash values even when presented in a different format. Our approach sticks to the core principles of the web, namely openness and decentralized architecture, is fully compatible with existing standards and protocols, and can therefore be used right away. Evaluation of our reference implementations shows that these desired properties are indeed accomplished by our approach, and that it remains practical even for very large files.

I rather like the author’s summary of their approach:

our proposed approach boils down to the idea that references can be made completely unambiguous and veri able if they contain a hash value of the referenced digital artifact.

Hash-URIs (assuming proper generation) would be completely unambiguous and verifiable for digital artifacts.

However, the authors fail to notice two important issues with Hash-URIs:

  1. Hash-URIs are not human readable.
  2. Not being human readable means that mappings between Hash-URIs and other references to digital artifacts will be fragile and hard to maintain.

For example,

In prose an author will not say, “As found by “” (from the article).

In some publishing styles, authors will say: “…as a new way of scientifi c publishing [8].”

In other styles, authors will say: “Computable functions are therefore those “calculable by finite means” (Turing, 1936: 230).”

That is to say of necessity there will be a mapping between the unambiguous and verifiable reference (UVR) and the ones used by human authors/readers.

Moreover, should the mapping between UVRs and their human consumable equivalents be lost, recovery is possible but time consuming.

The author’s go to some lengths to demonstrate the use of Hash-URIs with RDF files. RDF is one approach among many to digital artifacts.

If the mapping issues between Hash-URIs and other identifiers can be addressed, a more general approach to digital artifacts would make this proposal more viable.

I first saw this in a tweet by Tobias Kuhn.

Wikibase DataModel released!

Friday, January 3rd, 2014

Wikibase DataModel released! by Jeroen De Dauw.

From the post:

I’m happy to announce the 0.6 release of Wikibase DataModel. This is the first real release of this component.


Wikibase is the software behind At its core, this software is about describing entities. Entities are collections of claims, which can have qualifiers, references and values of various different types. How this all fits together is described in the DataModel document written by Markus and Denny at the start of the project. The Wikibase DataModel component contains (PHP) domain objects representing entities and their various parts, as well as associated domain logic.

I wanted to draw your attention to this discussion of “items:”

Items are Entities that are typically represented by a Wikipage (at least in some Wikipedia languages). They can be viewed as “the thing that a Wikipage is about,” which could be an individual thing (the person Albert Einstein), a general class of things (the class of all Physicists), and any other concept that is the subject of some Wikipedia page (including things like History of Berlin).

The IRI of an Item will typically be closely related to the URL of its page on Wikidata. It is expected that Items store a shorter ID string (for example, as a title string in MediaWiki) that is used in both cases. ID strings might have a standardized technical format such as “wd1234567890” and will usually not be seen by users. The ID of an Item should be stable and not change after it has been created.

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

However, it is intended that the information stored in Wikidata is generally about the topic of the Item. For example, the Item for History of Berlin should store data about this history (if there is any such data), not about Berlin (the city). It is not intended that data about one subject is distributed across multiple Wikidata Items: each Item fully represents one thing. This also helps for data integration across languages: many languages have no separate article about Berlin’s history, but most have an article about Berlin.

What do you make of the claim:

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

I may write an information system that fails to distinguish between a species of whales, a class of whales and a particular whale, but that is a design choice, not a foregone conclusion.

In the case of Wikipedia, which relies upon individuals repeating the task of extracting relevant information from loosely gathered data, that approach words quite well.

But there isn’t one degree of precision of identification that works for all cases.

My suspicion is that for more demanding search applications, such as drug interactions, less precise identifications could lead to unfortunate, even fatal, results.


Thinking, Fast and Slow (Review) [And Subject Identity]

Friday, November 15th, 2013

A statistical review of ‘Thinking, Fast and Slow’ by Daniel Kahneman by Patrick Burns.

From the post:

We are good intuitive grammarians — even quite small children intuit language rules. We can see that from mistakes. For example: “I maked it” rather than the irregular “I made it”.

In contrast those of us who have training and decades of experience in statistics often get statistical problems wrong initially.

Why should there be such a difference?

Our brains evolved for survival. We have a mind that is exquisitely tuned for finding things to eat and for avoiding being eaten. It is a horrible instrument for finding truth. If we want to get to the truth, we shouldn’t start from here.

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

The review focuses mainly on statistical issues in “Thinking Fast and Slow” but I think you will find it very entertaining.

I deeply appreciate Patrick’s quoting of:

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

In particular:

…relying on evidence that you can neither explain nor defend.

which resonates with me on subject identification.

Think about how we search for subjects, which of necessity involves some notion of subject identity.

What if a colleague asks if they should consult the records of the Order of the Garter to find more information on “Lady Gaga?”

Not entirely unreasonable since “Lady” is conferred upon female recipients of the Order of the Garter.

No standard search technique would explain why your colleague should not start with the Order of the Garter records.

Although I think most of us would agree such a search would be far afield. 😉

Every search starts with a searcher relying upon what they “know,” suspect or guess to be facts about a “subject” to search on.

At the end of the search, the characteristics of the subject as found, turn out to be the characteristics we were searching for all along.

I say all that to suggest that we need not bother users to say how in fact to identity the objects of their searches.

Rather the question should be:

What pointers or contexts are the most helpful to you when searching? (May or may not be properties of the search objective.)

Recalling that properties of the search objective are how we explain successful searches, not how we perform them.

Calling upon users to explain or make explicit what they themselves don’t understand, seems like a poor strategy for adoption of topic maps.

Capturing what “works” for a user, without further explanation or difficulty seems like the better choice.

PS: Should anyone ask about “Lady Gaga,” you can mention that Glamour magazine featured her on its cover, naming her Woman of the Year (December 2013 issue). I know that only because of a trip to the local drug store for a flu shot.

Promised I would be “in and out” in minutes. Literally true I suppose, it only took 50 minutes with four other people present when I arrived.

I have a different appreciation of “minutes” from the pharmacy staff. 😉

Targeting Phishing Victims

Friday, July 26th, 2013

Profile of Likely E-mail Phishing Victims Emerges in Human Factors/Ergonomics Research

From the webpage:

The author of a paper to be presented at the upcoming 2013 International Human Factors and Ergonomics Society Annual Meeting has described behavioral, cognitive, and perceptual attributes of e-mail users who are vulnerable to phishing attacks. Phishing is the use of fraudulent e-mail correspondence to obtain passwords and credit card information, or to send viruses.

In “Keeping Up With the Joneses: Assessing Phishing Susceptibility in an E-mail Task,” Kyung Wha Hong, Christopher M. Kelley, Rucha Tembe, Emergson Murphy-Hill, and Christopher B. Mayhorn, discovered that people who were overconfident, introverted, or women were less able to accurately distinguish between legitimate and phishing e-mails. She had participants complete a personality survey and then asked them to scan through both legitimate and phishing e-mails and either delete suspicious or spam e-mails, leave legitimate e-mails as is, or mark e-mails that required actions or responses as “important.”

“The results showed a disconnect between confidence and actual skill, as the majority of participants were not only susceptible to attacks but also overconfident in their ability to protect themselves,” says Hong. Although 89% of the participants indicted they were confident in their ability to identify malicious e-mails, 92% of them misclassified phishing e-mails. Almost 52% in the study misclassified more than half the phishing e-mails, and 54% deleted at least one authentic e-mail.

I would say that “behavioral, cognitive, and perceptual attributes” are a basis for identifying users. Or at least a certain type of users as a class.

Or to put it another way, a class of users is just as much a subject for discussion in a topic map as any of user individually.

It may be more important, either for targeting users for exploitation or protection to treat them as a class than as individuals.

BTW, these attributes don’t sound amenable to IRI identifiers or binary assignment choices.

Connecting the Dots. Or not.

Friday, June 7th, 2013

Why Verizon?

The first question that came to mind when the Guardian broke the NSA-Verizon news.

Here’s why I ask:

Verizon market share


Verizon over 2011-2012 had only 34% of the cell phone market.

Unless terrorists prefer Verizon for ideological reasons, why Verizon?

Choosing only Verizon means the NSA is missing 66% of potential terrorist cell traffic.

That sounds like a bad plan.

What other reason could there be for picking Verizon?

Consider some other known players:

President Barack Obama, candidate for President of the United States, 2012.

“Bundlers” who gathered donations for Barack Obama:

Min Max Name City State Employer
$200,000 $500,000 Hill, David Silver Spring MD Verizon Communications
$200,000 $500,000 Brown, Kathryn Oakton VA Verizon Communications
$50,000 $100,000 Milch, Randal Bethesda MD Verizon Communications

(Source: – 2012 Presidential – Bundlers)

BTW, the Max category means more money may have been given, but that is the top reporting category.

I have informally “identified” the bundlers as follows:

  • Kathryn C. Brown

    Kathryn C. Brown is senior vice president – Public Policy Development and Corporate Responsibility. She has been with the company since June 2002. She is responsible for policy development and issues management, public policy messaging, strategic alliances and public affairs programs, including Verizon Reads.

    Ms. Brown is also responsible for federal, state and international public policy development and international government relations for Verizon. In that role she develops public policy positions and is responsible for project management on emerging domestic and international issues. She also manages relations with think tanks as well as consumer, industry and trade groups important to the public policy process.

  • David A. Hill, Bloomberg Business Week reports: David A. Hill serves as Director of Verizon Maryland Inc.

    LinkedIn profile reports David A. Hill worked for Verizon, VP & General Counsel (2000 – 2006), Associate General Counsel (March 2006 – 2009), Vice President & Associate General Counsel (March 2009 – September 2011) “Served as a liaison between Verizon and the Obama Administration”

  • Randal S. Milch Executive Vice President – Public Policy and General Counsel

What is Verizon making for each data delivery? Is this cash for cash given?

If someone gave your more than $1 million (how much more is unknown), would you talk to them about such a court order?

If you read the “secret” court order, you will notice it was signed on April 23, 2013.

There isn’t a Kathryn C. Brown in Oakton in the White House visitor’s log, but I did find this record, where a “Kathryn C. Brown” made an appointment at the Whitehouse and was seen two (2) days later on the 17th of January 2013.

BROWN,KATHRYN,C,U69535,,VA,,,,,1/15/13 0:00,1/17/13 9:30,1/17/13 23:59,,176,CM,WIN,1/15/13 11:27,CM,,POTUS/FLOTUS,WH,State Floo,MCNAMARALAWDER,CLAUDIA,,,04/26/2013

I don’t have all the dots connected because I am lacking some unknown # of the players, internal Verizon communications, Verizon accounting records showing government payments, but it is enough to make you wonder about the purpose of the “secret” court order.

Was it a serious attempt at gathering data for national security reasons?

Or was it gathering data as a pretext for payments to Verizon or other contractors?

My vote goes for “pretext for payments.”

I say that because using data from different sources has always been hard.

In fact, about 60 to 80% of the time of a data analyst is spent “cleaning up data” for further processing.

The phrase “cleaning up data” is the colloquial form of “semantic impedance.”

Semantic impedance happens when the same people are known by different names in different data sets or different people are known by the same names in the same or different data sets.

Remember Kathryn Brown, of Oakton, VA? One of the Obama bundlers. Let’s use her as an example of “semantic impedance.”

The FEC has a record for Kathryn Brown of Oakton, VA.

But a search engine found:

Kathryn C. Brown

Same person? Or different?

I found another Kathryn Brown at Facebook:

And an image of Facebook Kathryn Brown:

Kathryn Brown, Facebook

And a photo from a vacation she took:


Not to mention the Kathryn Brown that I found at Twitter.

Kathryn Brown, Twitter

That’s only four (4) data sources and I have at least four (4) different Kathryn Browns.

Across the United States, a quick search shows 227,000 Kathryn Browns.

Remember that is just a personal name. What about different forms of addresses? Or names of employers? Or job descriptions? Or simple errors, like the 20% error rate in credit report records.

Take all the phones, plus names, addresses, employers, job descriptions, errors + other data and multiply that times 311.6 million Americans.

Can that problem be solved with petabytes of data and teraflops of processing?

Not a chance.

Remember that my identification of Kathryn “bundler” Brown with the Kathryn C. Brown of Verison was a human judgement, not an automatic rule. Nor would a computer think to check the White House visitor logs to see if another, possibly the same Kathryn C. Brown visited the White House before the secret order was signed.

Human judgement is required because all the data that the NSA has been collecting is “dirty” data, from one perspective or other. Either is is truly “dirty” in the sense of having errors or it is “dirty” in the sense it doesn’t play well with other data.

The Orwellian fearists can stop huffing and puffing about the coming eclipse of civil liberties. Those passed from view a short time after 9/11 with the passage of the Patriot Act.

That wasn’t the fault of ineffectual NSA data collection. American voters bear responsibility for the loss of civil liberties not voting leadership into office that would repeal the Patriot Act.

Ineffectual NSA data collection impedes the development of techniques that for a sanely scoped data collection effort could make a difference.

A sane scope for preventing terrorist attacks could be starting with a set of known or suspected terrorist phone numbers. Using all phone data (not just from Obama contributors), only numbers contacting or being contacted by those numbers would be subject to further analysis.

Using that much smaller set of phone numbers as identifiers, we could then collect other data, such as names and addresses associated with that smaller set of phone numbers. That doesn’t make the data any cleaner but it does give us a starting point for mapping “dirty” data sets into our starter set.

The next step would be create mappings from other data sets. If we say why we have created a mapping, others can evaluate the accuracy of our mappings.

Those tasks would require computer assistance, but they ultimately would be matters of human judgement.

Examples of such judgements exist, say for example in Palantir product line. If you watch Palantir Gotham being used to model biological relationships, take note of the results that were tagged by another analyst. And how the presenter tags additional material that becomes available to other researchers.

Computer assisted? Yes. Computer driven? No.

To be fair, human judgement is also involved in ineffectual NSA data collection efforts.

But it is human judgement that rewards sycophants and supporters, not serving the public interest.

Reidentification as Basic Science

Friday, May 31st, 2013

Reidentification as Basic Science by Arvind Narayanan.

From the post:

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.


A nice introduction to the major contours of reidentification, which the IT Law Wiki defines as:

Data re-identification is the process by which personal data is matched with its true owner.

Although in topic map speak I would usually say that personal data was used to identify its owner.

In a reidentification context, some effort has been made to obscure that relationship, so matching may be the better usage.

Depending on your data sources, something you may encounter when building a topic map.

I first saw this at Pete Warden’s Five short links.

Finding Subject Identifiers

Monday, April 1st, 2013

A recent comment made it clear that tooling, or the lack thereof, is a real issue for topic maps.

Here is my first suggestion of a tool you can use while authoring a topic map:


Seriously, think about it. You want a URL that identifies subject X.

Granting that Wikipedia is a fairly limited set of subjects, it is at least a starting point.

Example: I want a subject identifier for “Donald Duck,” a cartoon character.

I can use the search box at Wikipedia or I can type in a browser:

Go ahead, try it.

If I don’t know the full name:

What do you think?

Allows you to disambiguate Donalds, at least the ones that Wikipedia knows about.

Not to mention giving you access to other subjects and relationships that may be of interest for your topic map.

To include foreign language materials (outside of English only non-thinking zones in the U.S.), try a different language Wikipedia:

Finding subject identifiers won’t write your topic map for you but can make the job easier.

There are other sources of subject identifiers so send in your suggestions and any syntax short-cuts for accessing them.

You have no doubt read that URIs used as identifiers are supposed to be semi-permanent, “cool,” etc.

But identifiers change over time. It’s one of the reasons for historical semantic diversity.

URIs as identifiers will change as well.

Good thing topic maps enable you to have multiple identifiers for any subject.

Means old references to old identifiers still work.

Glad we dodged having to redo and reproof all those old connections.

Aren’t you?

New DataCorps Project: Refugees United

Sunday, January 27th, 2013

New DataCorps Project: Refugees United

From the post:

We are thrilled to announce the kick-off of a new DataKind project with Refugees United! Refugees United is a fantastic organization that uses mobile and web technologies to help refugees find their missing loved ones. Currently, RU’s system allows people to post descriptions of their family and friends as well as to search for them on the site. As you might imagine, lots of data flows through this system – data that could be used to greatly improve the way people find each other. Lead by the ever-brilliant Max Shron, the DataKind team is collaborating with Refugees United to explore what their data can tell them about how people are using the site, how they’re connecting to one another and, ultimately, how it can be used to help people find each other more effectively.

We are incredibly excited to work on this project and will be posting updates for you all as things unfoled. In the meantime, learn a bit more about Max and Refugees United.

I can’t comment on the identity practices because:

Q: 1.08 Why isn’t Refugees United open source yet?

Refugees United was born as an “offline” open source project. When we started, we were two guys (now six guys and a girl in Copenhagen, joined by a much larger team worldwide) with a great idea that had the potential to positively impact thousands, if not millions, of lives. The open source approach came from the fact that we wanted to build the world’s smallest refugee agency with the largest outreach, and to have the highest impact at the lowest cost.

One way to reach our objectives is to work with corporations around that world, including Ericsson, SAP, FedEx and others. The invaluable advice and expertise provided by these successful businesses – both the largest corporations and the smallest companies – have helped us to apply the structure and strategy of business to the passion and vision of an NGO.

Now the time has come for us to apply same structure to our software, and we have begun to collaborate with some of the wonderfully brilliant minds out there who wish to contribute and help us make a difference in the development of our technologies.

I am not sure what ‘”offline” open source’ means? The rest of the quoted prose doesn’t help.

Perhaps the software will become available online. At some point.

Would be a interesting data point to see how they are managing personal subject identity.

The Adams Workflow

Saturday, January 26th, 2013

The Adams Workflow

From the webpage:

The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

Same source as WEKA.

What if we think about identification as workflow?

Whatever stability we attribute to an identification is the absence of additional data that would create a change.

Looking backwards over prior identifications, we fit them into the schema of our present identification and that eliminates any movement from the past. The past is fixed and terminates in our present identification.

That view fails to appreciate the world isn’t going to end with any of us individually. The world and its information systems will continue, as will the workflow that defines identifications.

Replacing our identifications with newer ones.

The question we face is whether our actions will support or impede re-use of our identifications in the future.

I first saw Adams Workflow at Nat Torkington’s Four short links: 24 January 2013.

Chemical datuments as scientific enablers

Friday, January 25th, 2013

Chemical datuments as scientific enablers by Henry S Rzepa. (Journal of Cheminformatics 2013, 5:6 doi:10.1186/1758-2946-5-6)


This article is an attempt to construct a chemical datument as a means of presenting insights into chemical phenomena in a scientific journal. An exploration of the interactions present in a small fragment of duplex Z-DNA and the nature of the catalytic centre of a carbon-dioxide/alkene epoxide alternating co-polymerisation is presented in this datument, with examples of the use of three software tools, one based on Java, the other two using Javascript and HTML5 technologies. The implications for the evolution of scientific journals are discussed.

From the background:

Chemical sciences are often considered to stand at the crossroads of paths to many disciplines, including molecular and life sciences, materials and polymer sciences, physics, mathematical and computer sciences. As a research discipline, chemistry has itself evolved over the last few decades to focus its metaphorical microscope on both far larger and more complex molecular systems than previously attempted, as well as uncovering a far more subtle understanding of the quantum mechanical underpinnings of even the smallest of molecules. Both these extremes, and everything in between, rely heavily on data. Data in turn is often presented in the form of visual or temporal models that are constructed to illustrate molecular behaviour and the scientific semantics. In the present article, I argue that the mechanisms for sharing both the underlying data, and the (semantic) models between scientists need to evolve in parallel with the increasing complexity of these models. Put simply, the main exchange mechanism, the scientific journal, is accepted [1] as seriously lagging behind in its fitness for purpose. It is in urgent need of reinvention; one experiment in such was presented as a data-rich chemical exploratorium [2]. My case here in this article will be based on my recent research experiences in two specific areas. The first involves a detailed analysis of the inner kernel of the Z-DNA duplex using modern techniques for interpreting the electronic properties of a molecule. The second recounts the experiences learnt from modelling the catalysed alternating co-polymerisation of an alkene epoxide and carbon dioxide.

Effective sharing of data, in scientific journals or no, requires either a common semantic (we know that’s uncommon) or a mapping between semantics (how may times must we repeat the same mappings, separately?).

Embedding notions of subject identity and mapping between identifications in chemical datuments could increase the reuse of data, as well as its longevity.

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Friday, October 26th, 2012

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=””>Automattic</a> Open Sources Natural Language Spell-Checker <a href=””>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.


Visual Clues: A Brain “feature,” not a “bug”

Saturday, September 29th, 2012

You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:

You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.

A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.

“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.

“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)

I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed! 😉

But that’s the trick of subject identification/identity isn’t it?

That our brains “recognize” all manner of subjects without any effort on our part.

Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.

Or rather, making it more work than we are usually willing to devote to digging it out.

When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.

Two quick points:

First, need to think about how to incorporate this “feature” into delivery interfaces for users.

Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)

Identities and Identifications: Politicized Uses of Collective Identities

Monday, September 17th, 2012

Identities and Identifications: Politicized Uses of Collective Identities

Deadline for Panels 15 January 2013
Deadline for Papers 1 March 2013
Conference 18-20 April 2013, Zagreb, Croatia

From the call for panels and papers:

Identity is one of the crown jewelleries in the kingdom of ‘contested concepts’. The idea of identity is conceived to provide some unity and recognition while it also exists by separation and differentiation. Few concepts were used as much as identity for contradictory purposes. From the fragile individual identities as self-solidifying frameworks to layered in-group identifications in families, orders, organizations, religions, ethnic groups, regions, nation-states, supra-national entities or any other social entities, the idea of identity always shows up in the core of debates and makes everything either too dangerously simple or too complicated. Constructivist and de-constructivist strategies have led to the same result: the eternal return of the topic. Some say we should drop the concept, some say we should keep it and refine it, some say we should look at it in a dynamic fashion while some say it’s the reason for resistance to change.

If identities are socially constructed and not genuine formations, they still hold some responsibility for inclusion/exclusion – self/other nexuses. Looking at identities in a research oriented manner provides explanatory tolls for a wide variety of events and social dynamics. Identities reflect the complex nature of human societies and generate reasonable comprehension for processes that cannot be explained by tracing pure rational driven pursuit of interests. The feelings of attachment, belonging, recognition, the processes of values’ formation and norms integration, the logics of appropriateness generated in social organizations are all factors relying on a certain type of identity or identification. Multiple identifications overlap, interact, include or exclude, conflict or enhance cooperation. Identities create boundaries and borders; define the in-group and the out-group, the similar and the excluded, the friend and the threatening, the insider and the ‘other’.

Beyond their dynamic fuzzy nature that escapes exhaustive explanations, identities are effective instruments of politicization of social life. The construction of social forms of organization and of specific social practices together with their imaginary significations requires all the time an essentialist or non-essentialist legitimating act of belonging; a social glue that extracts its cohesive function from the identification of the in-group and the power of naming the other. Identities are political. Multicultural slogans populate extensively the twenty-first century yet the distance between the ideal and the real multiculturalism persists while the virtues of inclusion coexist with the adversity of exclusion. Dealing with the identities means to integrate contestation into contestation until potentially a n degree of contestation. Due to the confusion between identities and identifications some scholars demanded that the concept of identity shall be abandoned. Identitarian issues turned out to be efficient tools for politicization of a ‘constraining dissensus’ while universalizing terms included in the making of the identities usually tend or intend to obscure the localized origins of any identitarian project. Identities are often conceptually used as rather intentional concepts: they don’t say anything about their sphere but rather defining the sphere makes explicit the aim of their usage. It is not ‘identity of’ but ‘identity to’.

Quick! Someone get them a URL! 😉 Just teasing.

Enjoy the conference!

Author Identifiers ( [> one (1) identifier per subject]

Monday, September 10th, 2012

I happened upon an author who used an author identifier at their webpage.

From the page:

It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv. Such identification would provide accurate results for queries such as "show me all the other papers by the particular John Smith that wrote this paper", something that can be done only approximately with text-based searches. It would also permit construction of an author-article graph which is useful for relevance
assessment and bibliometric analysis.

Since 2005 arXiv has used authority records that associate user accounts with articles authored by that user. These records support the endorsement system. The use of public author identifiers as a way to build services upon this data is new in 2009. Initially, users must opt-in to have a public author identifier and to expose the record of their articles on arXiv for use in other services. At some later date we hope to be able to improve our authority records to the point where we can create public author identifiers for all authors of arXiv articles without needing to enlist the help of each author to check their record before opting in.

The services we offer based on author identifiers are:

Significant enough in its own right but note the plans for the future:

The following enhancements and interoperability features are planned:

  • arXiv will permit authors to record other identifiers they have in other schemes and include these in the data feeds. This will allow agents and systems to link together the same author in different databases.
  • arXiv will support mechanisms for handling name changes, combination of accidentally created duplicates and separation of accidentally combined identifiers.

Recoding other identifiers? What? Acknowledge that there can be more than one identifier (yours) per subject? Blasphemy! 😉


I am sure suggestions, support, contributions, etc., would be most welcome.

‘The Algorithm That Runs the World’ [Optimization, Identity and Polytopes]

Tuesday, August 28th, 2012

“The Algorithm That Runs the World” by Erwin Gianchandani.

From the post:

New Scientist published a great story last week describing the history and evolution of the simplex algorithm — complete with a table capturing “2000 years of algorithms”:

The simplex algorithm directs wares to their destinations the world over [image courtesy PlainPicture/Gozooma via New Scientist].Its services are called upon thousands of times a second to ensure the world’s business runs smoothly — but are its mathematics as dependable as we thought?

YOU MIGHT not have heard of the algorithm that runs the world. Few people have, though it can determine much that goes on in our day-to-day lives: the food we have to eat, our schedule at work, when the train will come to take us there. Somewhere, in some server basement right now, it is probably working on some aspect of your life tomorrow, next week, in a year’s time.

Perhaps ignorance of the algorithm’s workings is bliss. The door to Plato’s Academy in ancient Athens is said to have borne the legend “let no one ignorant of geometry enter”. That was easy enough to say back then, when geometry was firmly grounded in the three dimensions of space our brains were built to cope with. But the algorithm operates in altogether higher planes. Four, five, thousands or even many millions of dimensions: these are the unimaginable spaces the algorithm’s series of mathematical instructions was devised to probe.

Perhaps, though, we should try a little harder to get our heads round it. Because powerful though it undoubtedly is, the algorithm is running into a spot of bother. Its mathematical underpinnings, though not yet structurally unsound, are beginning to crumble at the edges. With so much resting on it, the algorithm may not be quite as dependable as it once seemed [more following the link].

A fund manager might similarly want to arrange a portfolio optimally to balance risk and expected return over a range of stocks; a railway timetabler to decide how best to roster staff and trains; or a factory or hospital manager to work out how to juggle finite machine resources or ward space. Each such problem can be depicted as a geometrical shape whose number of dimensions is the number of variables in the problem, and whose boundaries are delineated by whatever constraints there are (see diagram). In each case, we need to box our way through this polytope towards its optimal point.

This is the job of the algorithm.

Its full name is the simplex algorithm, and it emerged in the late 1940s from the work of the US mathematician George Dantzig, who had spent the second world war investigating ways to increase the logistical efficiency of the U.S. air force. Dantzig was a pioneer in the field of what he called linear programming, which uses the mathematics of multidimensional polytopes to solve optimisation problems. One of the first insights he arrived at was that the optimum value of the “target function” — the thing we want to maximise or minimise, be that profit, travelling time or whatever — is guaranteed to lie at one of the corners of the polytope. This instantly makes things much more tractable: there are infinitely many points within any polytope, but only ever a finite number of corners.

If we have just a few dimensions and constraints to play with, this fact is all we need. We can feel our way along the edges of the polytope, testing the value of the target function at every corner until we find its sweet spot. But things rapidly escalate. Even just a 10-dimensional problem with 50 constraints — perhaps trying to assign a schedule of work to 10 people with different expertise and time constraints — may already land us with several billion corners to try out.

Apologies but I saw this article too late to post within the “free” days allowed by New Scientist.

But, I think from Erwin’s post and long quote from the original article, you can see how the simplex algorithm may be very useful where identity is defined in multidimensional space.

The literature in this area is vast and it may not offer an appropriate test for all questions of subject identity.

For example, the possessor of a credit card is presumed to be the owner of the card. Other assumptions are possible, but fraud costs are recouped from fees paid by customers. Creating a lack of interest in more stringent identity tests.

On the other hand, if your situation requires multidimensional identity measures, this may be a useful approach.

PS: Be aware that naming confusion, the sort that can be managed (not solved) by topic maps abounds even in mathematics:

The elements of a polytope are its vertices, edges, faces, cells and so on. The terminology for these is not entirely consistent across different authors. To give just a few examples: Some authors use face to refer to an (n−1)-dimensional element while others use face to denote a 2-face specifically, and others use j-face or k-face to indicate an element of j or k dimensions. Some sources use edge to refer to a ridge, while H. S. M. Coxeter uses cell to denote an (n−1)-dimensional element. (Polytope)

UCR Insect Classification Contest [Classification by Ear]

Friday, July 6th, 2012

UCR Insect Classification Contest Ends November 16, 2012

As I have said before, subject identity is everywhere! 😉

From the details PDF file:

Phase I: July to November 16th 2012 (this contest)

  • The task is to produce the best distance (similarity) measure for insect flight sounds.
  • The contest will be scored by 1-nearest neighbor classification.
  • The prizes include $500 cash and engraved trophies.

I was amused to read in the FAQ:

Note that the “sound” is measured with an optical sensor, rather than an acoustic one. This is done for various pragmatic reasons, however we don’t believe it makes any difference to the task at hand. The sampling rate is 16000 Hz

If you have a bee keeper nearby, can you do an empirical comparison of optical versus acoustic sensors for the capturing the “sound” of insects?

That seems like a first step in establishing computational entomology. BTW, use a range of frequencies, from sub to super sonic. (You are aware they have discovered sub-sonic sounds from whales can travel thousands of miles? Unlikely with insects but just because our ears can’t hear something doesn’t mean other ears cannot as well.)

I first saw this at KDNuggets.

Who Do You Say You Are?

Friday, May 11th, 2012

In Data Governance in Context, Jim Ericson outlines several paths of data governance, or as I put it: Who Do You Say You Are?:

On one path, more enterprises are dead serious about creating and using data they can trust and verify. It’s a simple equation. Data that isn’t properly owned and operated can’t be used for regulatory work, won’t be trusted to make significant business decisions and will never have the value organizations keep wanting to ascribe it on the balance sheet. We now know instinctively that with correct and thorough information, we can jump on opportunities, unite our understanding and steer the business better than before.

On a similar path, we embrace tested data in the marketplace (see Experian, D&B, etc.) that is trusted for a use case even if it does not conform to internal standards. Nothing wrong with that either.

And on yet another path (and areas between) it’s exploration and discovery of data that might engage huge general samples of data with imprecise value.

It’s clear that we cannot and won’t have the same governance standards for all the different data now facing an enterprise.

For starters, crowd sourced and third party data bring a new dimension, because “fitness for purpose” is by definition a relative term. You don’t need or want the same standard for how many thousands or millions of visitors used a website feature or clicked on a bundle in the way you maintain your customer or financial info.

Do mortgage-backed securities fall into the “…huge general samples of data with imprecise value?” I ask because I don’t work in the financial industry. Or do they not practice data governance, except to generate numbers for the auditors?

I mention this because I suspect that subject identity governance would be equally useful for topic map authoring.

For some topic maps, say on drug trials, need to have a high degree of reliability and auditability. As well as precise identification (even if double-blind) of the test subjects.

Or there may be different tests for subject identity, some of which appear to be less precise than others.

For example, merging all the topics entered by a particular operator in a day to look for patterns that may indicate they are not following data entry protocols. (It is hard to be as random as real data.)

As with most issues, there isn’t any hard and fast rule that works for all cases. You do need to document the rules you are following and for how long. It will help you test old rules and to formulate new ones. (“Document” meaning to write down. The vagaries of memory are insufficient.)

20 More Reasons You Need Topic Maps

Thursday, May 3rd, 2012

Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).

Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.

So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.

Past, Present and Future – The Quest to be Understood

Friday, April 20th, 2012

Without restricting it to being machine readable, I think we would all agree there are three ages of data:

  1. Past data
  2. Present data
  3. Future data

And we have common goals for data (or parts of it):

  1. Past data – To understand past data.
  2. Present data – To be understood by others.
  3. Future data – For our present data to persist and be understood by then users.

Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)

I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.

You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.

Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:

I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.

Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.

Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.

What do you think?

Then BI and Data Science Thinking Are Flawed, Too

Tuesday, March 13th, 2012

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

Identification – A Step Towards Semantics

Monday, February 6th, 2012

Entity resolution, also known as name resolution, refers to resolving a reference. The usual language continues with something to the effect “…real world object, person, etc.” I omit the usual language because a reference can be to anything.

Do you disagree? Just curious.

I have more comments on “resolution” but omit them here to reach the necessity for identification.

I say the “necessity for identification” as a prerequisite before semantics can be assigned to any identifier. Of whatever nature. Could be text, photograph, digital image, URI, etc.

I say that because identification (or recognition, a closely related task) may have different requirements (processing and otherwise) than the assignment of semantics, or the assignment of other properties.

For example, with legacy text, written on read-only media, if my only concern at the first step in the process is identification, I can create a list of terms that represent the terms I wish to have recognized. (Think of the TLG for example.) The semantics that I or anyone else wishes to associate with those terms becomes an entirely separate matter. Whatever means or system of semantics is used.

That only becomes possible if the notion of assigning semantics is separated from the task of identification. And separated from the organization of what has been identified and assigned semantics into a system (think ontology).

(You might want to read what was possible twenty-two (22) years ago with transient nodes and edges before responding to this post. I would extend that type of mechanisms to recognition, assignments of semantics and other properties. Having a fixed identification, assignment of semantics and other properties being only one choice among many.)