Archive for the ‘Identity’ Category

Open Ownership Project

Thursday, November 9th, 2017

Open Ownership Project

From about page:

OpenOwnership is driven by a steering group composed of leading transparency NGOs, including Global Witness, Open Contracting Partnership, Web Foundation, Transparency International, the ONE Campaign, and the B Team, as well as OpenCorporates.

OpenOwnership’s central goal is to build an open Global Beneficial Ownership Register, which will serve as an authoritative source of data about who owns companies, for the benefit of all. This data will be global and linked across jurisdictions, industries, and linkable to other datasets too.

Alongside the register, OpenOwnership is developing a universal and open data standard for beneficial ownership, providing a solid conceptual and practical foundation for collecting and publishing beneficial ownership data.

I first visited the Open Ownership Project site following two (of four) posts on verifying beneficial ownership.

What we really mean when we talk about verification (Part 1 of 4) by Zosia Sztykowski and Chris Taggart.

From the post:

This is the first of a series of blog posts in which we will discuss the critical but tricky issue of verification, particularly with respect to beneficial ownership.

‘Verification’ is frequently said to be a critical step in generating high-quality beneficial ownership information. What’s less clear is what is actually meant by verification, and what are the key factors in the process. In fact, verification is not one step, but three:

  1. Ensuring that the person making a statement about beneficial ownership is who they say they are, and that they have the right to make the claim (authentication and authorization);

  2. Ensuring that the data submitted is a legitimate possible value (validation);

  3. Verifying that the statement made is actually true (which we will call truth verification).

Another critical factor is whether these processes are done on individual filings, typically hand-written pieces of paper, or their PDF equivalents, or whole datasets of beneficial ownership data. While verification processes are possible on individual filings, this series will show that that public, digital, structured beneficial ownership data adds an additional layer of verification not possible with traditional filings.

Understanding precisely how verification takes place in the lifecycle of a beneficial ownership datum is an important step in knowing what beneficial ownership data can tell us about the world. Each of the stages above will be covered in more detail in this series, but let’s linger on the final one for a moment.

What we really mean when we talk about verification: Authentication & authorization (Part 2 of 4)

In the first post in this series on the principles of verification, particularly relating to beneficial ownership, we explained why there is no guarantee that any piece of beneficial ownership data is the absolute truth.

The data collected is still valuable, however, providing it is made available publicly as open data, as it exposes lies and half-truths to public scrutiny, raising red flags that indicate potential criminal or unethical activity.

We discussed a three-step process of verification:

  1. Ensuring that the person making a statement about beneficial ownership is who they say they are (authentication), and that they have the right to make the claim (authorization);

  2. Ensuring that the data submitted is a legitimate possible value (validation);

  3. Verifying that the statement made is actually true (which we will call truth verification).

In this blog post, we will discuss the first of these, focusing on how to tell who is actually making the claims, and whether they are authorized to do so.

When authentication and authorization have been done, you can approach the information with more confidence. Without them, you may have little better than anonymous statements. Critically, with them, you can also increase the risks for those who wish to hide their true identities and the nature of their control of companies.

Parts 3 and 4 are forthcoming (as of 9 November 2017).

A beta version of the Beneficial Ownership Data Standard (BODS) was released last April (2017). A general overview appeared in June, 2017: Introducing the Beneficial Ownership Data Standard.

Identity issues are rife in ownership data so when planning your volunteer activity for 2018, keep the Open Ownership project in mind.

V Sign Biometrics [Building Privacy Zones a/k/a Unobserved Spaces]

Tuesday, March 8th, 2016

Machine-Learning Algorithm Aims to Identify Terrorists Using the V Signs They Make

From the post:

Every age has its iconic images. One of the more terrifying ones of the 21st century is the image of a man in desert or army fatigues making a “V for victory” sign with raised arm while standing over the decapitated body of a Western victim. In most of these images, the perpetrator’s face and head are covered with a scarf or hood to hide his identity.

That has forced military and law enforcement agencies to identify these individuals in other ways, such as with voice identification. This is not always easy or straightforward, so there is significant interest in finding new ways.

Today, Ahmad Hassanat at Mu’tah University in Jordan and a few pals say they have found just such a method. These guys say they have worked out how to distinguish people from the unique way they make V signs; finger size and the angle between the fingers is a useful biometric measure like a fingerprint.

The idea of using hand geometry as a biometric indicator is far from new. Many anatomists have recognized that hand shape varies widely between individuals and provides a way to identify them, if the details can be measured accurately. (emphasis in original)

The review notes this won’t give you personal identity but would have to be combined with other data.

Overview of: Victory Sign Biometric for Terrorists Identification by Ahmad B. A. Hassanata, Mahmoud B. Alhasanat, Mohammad Ali Abbadi, Eman Btoush, Mouhammd Al-Awadi.

Abstract:

Covering the face and all body parts, sometimes the only evidence to identify a person is their hand geometry, and not the whole hand- only two fingers (the index and the middle fingers) while showing the victory sign, as seen in many terrorists videos. This paper investigates for the first time a new way to identify persons, particularly (terrorists) from their victory sign. We have created a new database in this regard using a mobile phone camera, imaging the victory signs of 50 different persons over two sessions. Simple measurements for the fingers, in addition to the Hu Moments for the areas of the fingers were used to extract the geometric features of the shown part of the hand shown after segmentation. The experimental results using the KNN classifier were encouraging for most of the recorded persons; with about 40% to 93% total identification accuracy, depending on the features, distance metric and K used.

All of which makes me suspect that giving a surveillance camera the “finger,” indeed, your height, gait, any physical mannerism, are fodder for surveillance systems.

Hotels and businesses need to construct privacy zones for customers to arrive and depart free from surveillance.

Fifty Words for Databases

Saturday, March 7th, 2015

Fifty Words for Databases by Phil Factor

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

The Coming Era of Egocentric Video Analysis

Tuesday, December 9th, 2014

The Coming Era of Egocentric Video Analysis

From the post:

Head-mounted cameras are becoming de rigueur for certain groups—extreme sportsters, cyclists, law enforcement officers, and so on. It’s not hard to find content generated in this way on the Web.

So it doesn’t take a crystal ball to predict that egocentric recording is set to become ubiquitous as devices such as Go-Pros and Google Glass become more popular. An obvious corollary to this will be an explosion of software for distilling the huge volumes of data this kind of device generates into interesting and relevant content.

Today, Yedid Hoshen and Shmuel Peleg at the Hebrew University of Jerusalem in Israel reveal one of the first applications. Their goal: to identify the filmmaker from biometric signatures in egocentric videos.

A tidbit that I was unaware of:

Some of these are unique, such as the gait of the filmmaker as he or she walks, which researchers have long known is a remarkably robust biometric indicator.”Although usually a nuisance, we show that this information can be useful for biometric feature extraction and consequently for identifying the user,” say Hoshen and Peleg.

Makes me wonder if I should wear a prosthetic device to alter my gait when I do appear in range of cameras. 😉

Works great with topic maps. All you may know about an actor is that they have some gait with X characteristics. And a perchance for not getting caught planting explosive devices. With a topic map we can keep their gait as a subject identifier and record all the other information we have on such an individual.

If we ever match the gait to a known individual, then the information from both records, both as the anonymous gait owner and the known known individual will be merged together.

It works with other characteristics as well, which enables you to work from “I was attacked…,” to more granular information that narrows the pool of suspects down to a manageable size.

Traditionally the job of veterans on the police force who know their communities and who are the usual suspects but a topic map enhances their value by capturing their observations for use by the department long after a veterans retirement.

From arXiv: Egocentric Video Biometrics

Abstract:

Egocentric cameras are being worn by an increasing number of users, among them many security forces worldwide. GoPro cameras already penetrated the mass market, and Google Glass may follow soon. As head-worn cameras do not capture the face and body of the wearer, it may seem that the anonymity of the wearer can be preserved even when the video is publicly distributed.
We show that motion features in egocentric video provide biometric information, and the identity of the user can be determined quite reliably from a few seconds of video. Biometrics are extracted by training Convolutional Neural Network (CNN) architectures on coarse optical flow.

Egocentric video biometrics can prevent theft of wearable cameras by locking the camera when worn by people other than the owner. In video sharing services, this Biometric measure can help to locate automatically all videos shot by the same user. An important message in this paper is that people should be aware that sharing egocentric video will compromise their anonymity.

Now if we could just get members of Congress to always carry their cellphones and wear body cameras.

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

Friday, September 19th, 2014

GDS unveils ‘Gov.UK Verify’ public services identity assurance scheme

From the post:

Gov.UK Verify is designed to overcome concerns about government setting up a central database of citizens’ identities to enable access to online public services – similar criticism led to the demise of the hugely unpopular identity card scheme set up under the Labour government.

Instead, users will register their details with one of several independent identity assurance providers – certified companies which will establish and verify a user’s identity outside government systems. When the user then logs in to a digital public service, the Verify system will electronically “ask” the external third-party provider to confirm the person is who they claim to be.

HELP!

Help me make sure I am reading this story of citizen identity correctly.

Citizens are fearful of their government having a central database of citizens’ identities but are comfortable with commercial firms, regulated by same government, managing those identities?

Do you think citizens of the UK are aware that commercial firms betray their customers to the U.S. government at the drop of a secret subpoena every day?

To say nothing of the failures of commercial firms to protect data from their customers, when they aren’t using that data to directly manipulate their customers.

Strikes me as damned odd that anyone would trust commercial firms more than they would trust the government. Neither one is actually trustworthy.

Am I reading this story correctly?

I first saw this in a tweet by Richard Copley.

Improving GitHub for science

Thursday, May 15th, 2014

Improving GitHub for science

From the post:

GitHub is being used today to build scientific software that’s helping find Earth-like planets in other solar systems, analyze DNA, and build open source rockets.

Seeing these projects and all this momentum within academia has pushed us to think about how we can make GitHub a better tool for research. As scientific experiments become more complex and their datasets grow, researchers are spending more of their time writing tools and software to analyze the data they collect. Right now though, these efforts often happen in isolation.

Citable code for academic software

Sharing your work is good, but collaborating while also getting required academic credit is even better. Over the past couple of months we’ve been working with the Mozilla Science Lab and data archivers, Figshare and Zenodo, to make it possible to get a Digital Object Identifier (DOI) for any GitHub repository archive.

DOIs form the backbone of the academic reference and metrics system. With a DOI for your GitHub repository archive, your code becomes citable. Our newest Guide explains how to create a DOI for your repository.

A great step forward, but like http: pointing to entire resources, it is of limited utility.

Assume that I am using a DOI for a software archive and I want to point to and identify a code snippet in the archive that implements Fast Fourier Transform (FFT). My first task is to point to that snippet. A second task would be to create an association between the snippet and my annotation that it implements the Fast Fourier Transform. Yet a third task would be to gather up all the pointers that point to implementations of the Fast Fourier Transform (FFT).

For all of those tasks, I need to identify and point to a particular part of the underlying source code.

Unfortunately, a DOI is limited to identifying a single entity.

Each DOI® name is a unique “number”, assigned to identify only one entity. Although the DOI system will assure that the same DOI name is not issued twice, it is a primary responsibility of the Registrant (the company or individual assigning the DOI name) and its Registration Agency to identify uniquely each object within a DOI name prefix. (DOI Handbook

How would you extend the DOIs being used by GitHub to identify code fragments within source code repositories?

I first saw this in a tweet by Peter Desmet.

Developing a 21st Century Global Library for Mathematics Research

Thursday, April 3rd, 2014

Developing a 21st Century Global Library for Mathematics Research by Committee on Planning a Global Library of the Mathematical Sciences.

Care to guess what one of the major problems facing mathematical research might be?

Currently, there are no satisfactory indexes of many mathematical objects, including symbols and their uses, formulas, equations, theorems, and proofs, and systematically labeling them is challenging and, as of yet, unsolved. In many fields where there are more specialized objects (such as groups, rings, fields), there are community efforts to index these, but they are typically not machine-readable, reusable, or easily integrated with other tools and are often lacking editorial efforts. So, the issue is how to identify existing lists that are useful and valuable and provide some central guidance for further development and maintenance of such lists. (p. 26)

Does that surprise you?

What do you think the odds are of mathematical research slowing down enough for committees to decide on universal identifiers for all the subjects in mathematical publications?

That’s about what I thought.

I have a different solution: Why not ask mathematicians who are submitting articles for publication to identity (specify properties for) what they consider to be the important subjects in their article?

The authors have the knowledge and skill, not to mention the motivation of wanting their research to be easily found by others.

Over time I suspect that particular fields will develop standard identifications (sets of properties per subject) that mathematicians can reuse to save themselves time when publishing.

Mappings across those sets of properties will be needed but that can be the task of journals, researchers and indexers who have an interest and skill in that sort of enterprise.

As opposed to having a “boil the ocean” approach that tries to do more than any one project is capable of doing competently.

Distributed subject identification is one way to think about it. We already do it, this would be a semi-formalization of that process and writing down what each author already knows.

Thoughts?

PS: I suspect the condition recited above is true for almost any sufficiently large field of study. A set of 150 million entities sounds large only without context. In the context of of science, it is a trivial number of entities.

Thinking, Fast and Slow (Review) [And Subject Identity]

Friday, November 15th, 2013

A statistical review of ‘Thinking, Fast and Slow’ by Daniel Kahneman by Patrick Burns.

From the post:

We are good intuitive grammarians — even quite small children intuit language rules. We can see that from mistakes. For example: “I maked it” rather than the irregular “I made it”.

In contrast those of us who have training and decades of experience in statistics often get statistical problems wrong initially.

Why should there be such a difference?

Our brains evolved for survival. We have a mind that is exquisitely tuned for finding things to eat and for avoiding being eaten. It is a horrible instrument for finding truth. If we want to get to the truth, we shouldn’t start from here.

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

The review focuses mainly on statistical issues in “Thinking Fast and Slow” but I think you will find it very entertaining.

I deeply appreciate Patrick’s quoting of:

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

In particular:

…relying on evidence that you can neither explain nor defend.

which resonates with me on subject identification.

Think about how we search for subjects, which of necessity involves some notion of subject identity.

What if a colleague asks if they should consult the records of the Order of the Garter to find more information on “Lady Gaga?”

Not entirely unreasonable since “Lady” is conferred upon female recipients of the Order of the Garter.

No standard search technique would explain why your colleague should not start with the Order of the Garter records.

Although I think most of us would agree such a search would be far afield. 😉

Every search starts with a searcher relying upon what they “know,” suspect or guess to be facts about a “subject” to search on.

At the end of the search, the characteristics of the subject as found, turn out to be the characteristics we were searching for all along.

I say all that to suggest that we need not bother users to say how in fact to identity the objects of their searches.

Rather the question should be:

What pointers or contexts are the most helpful to you when searching? (May or may not be properties of the search objective.)

Recalling that properties of the search objective are how we explain successful searches, not how we perform them.

Calling upon users to explain or make explicit what they themselves don’t understand, seems like a poor strategy for adoption of topic maps.

Capturing what “works” for a user, without further explanation or difficulty seems like the better choice.


PS: Should anyone ask about “Lady Gaga,” you can mention that Glamour magazine featured her on its cover, naming her Woman of the Year (December 2013 issue). I know that only because of a trip to the local drug store for a flu shot.

Promised I would be “in and out” in minutes. Literally true I suppose, it only took 50 minutes with four other people present when I arrived.

I have a different appreciation of “minutes” from the pharmacy staff. 😉

Reidentification as Basic Science

Friday, May 31st, 2013

Reidentification as Basic Science by Arvind Narayanan.

From the post:

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

(…)

A nice introduction to the major contours of reidentification, which the IT Law Wiki defines as:

Data re-identification is the process by which personal data is matched with its true owner.

Although in topic map speak I would usually say that personal data was used to identify its owner.

In a reidentification context, some effort has been made to obscure that relationship, so matching may be the better usage.

Depending on your data sources, something you may encounter when building a topic map.

I first saw this at Pete Warden’s Five short links.

Construction of Controlled Vocabularies

Tuesday, April 2nd, 2013

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

Example:
salinity/saltiness
  VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

Example:
Mercury (planet)
  Mercury (metal)
  Mercury (automobile)
  Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

O Knoweldge Graph, Where Art Thou?

Monday, February 11th, 2013

O Knoweldge Graph, Where Art Thou? by Matthew Hurst.

From the post:

The web search community, in recent months and years, has heard quite a bit about the ‘knowledge graph’. The basic concept is reasonably straightforward – instead of a graph of pages, we propose a graph of knowledge where the nodes are atoms of information of some form and the links are relationships between those statements. The knowledge graph concept has become established enough for it to be used as a point of comparison between Bing and Google.

….

Much of what we see out there in the form of knowledge returned for searches is really isolated pockets of related information (the date and place of brith of a person, for example). The really interesting things start happening when the graphs of information become unified across type, allowing – as suggested by this example – the user to traverse from a performer to a venue to all the performers at that venue, etc. Perhaps ‘knowledge engineer’ will become a popular resume-buzz word in the near future as ‘data scientest’ has become recently.

Read Matthew’s post for the details of the comparison.

+1! to going from graphs of pages to graphs of “atoms of information.”

I am less certain about “…graphs of information become unified across type….”

What I am missing is the reason to think that “type,” unlike any other subject, will have a uniform identification.

If we solve the problem of not requiring “type” to have a uniform identification, why not apply that to other subjects as well?

Without an express or implied requirement for uniform identification, all manner of “interesting things” will be happening in knowledge graphs.

(Note the plural, knowledge graphs, not knowledge graph.)

The Semantic Web Is Failing — But Why? (Part 5)

Thursday, February 7th, 2013

Impoverished Identification by URI

There is one final part of the faliure of the Semantic Web puzzle to explore before we can talk about a solution.

In owl:sameAs and Linked Data: An Empircal Study, Ding, Shinavier, Finin and McGuinness write:

Our experimental results have led us to identify several issues involving the owl:sameAs property as it is used in practice in a linked data context. These include how best to manage owl:sameAs assertions from “third parties”, problems in merging assertions from sources with different contexts, and the need to explore an operational semantics distinct from the strict logical meaning provided by OWL.

To resolve varying usages of owl:sameAs, the authors go beyond identifications provided by a URI to look to other properties. For example:

Many owl:sameAs statements are asserted due to the equivalence of the primary feature of resource description, e.g. the URIs of FOAF profiles of a person may be linked just because they refer to the same person even if the URIs refer the person at different ages. The odd mashup on job-title in previous section is a good example for why the URIs in different FOAF profiles are not fully equivalent. Therefore, the empirical usage of owl:sameAs only captures the equivalence semantics on the projection of the URI on social entity dimension (removing the time and space dimensions). In thisway, owl:sameAs is used to indicate p artial equivalence between two different URIs, which should not be considered as full equivalence.

Knowing the dimensions covered by a URI and the dimensions covered by a property, it is possible to conduct better data integration using owl:sameAs. For example, since we know a URI of a person provides a temporal-spatial identity, descriptions using time-sensitive properties, e.g. age, height and workplace, should not be aggregated, while time-insensitive properties, such as eye color and social security number, may be aggregated in most cases.

When an identification is insufficient based on a single URI, additional properties can be considered.

My question then is why do ordinary users have to wait for experts to decide their identifications are insufficient? Why can’t we empower users to declare multiple properties, including URIs, as a means of identification?

It could be something as simple as JSON key/value pairs with a notation of “+” for must match, “-” for must not match, and “?” for optional to match.

A declaration of identity by users about the subjects in their documents. Who better to ask?

Not to mention that the more information supplies with for an identification, the more likely they are to communicate, successfully, with other users.

URIs may be Tim Berners-Lee’s nails, but they are insufficient to support the scaffolding required for robust communication.


The next series starts with Saving the “Semantic” Web (Part 1)

The Semantic Web Is Failing — But Why? (Part 1)

Thursday, February 7th, 2013

Introduction

Before proposing yet another method for identification and annotation of entities in digital media, it is important to draw lessons from existing systems. Failing systems in particular, so their mistakes are not repeated or compounded. The Semantic Web is an example of such a system.

Doubters of that claim should the report Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus by Web Data Commons.

Web Data Commons is a structured data research project based at the Research Group Data and Web Science at the University of Mannheim and the Institute AIFB at the Karlsruhe Institute of Technology. Supported by PlanetData and LOD2 research projects, the Web Data Commons is not opposed to the Semantic Web.

But the Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus document reports:

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together. (emphasis added)

To sharpen the point, RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008).

Or more generally:

Parsed HTML URLs 3,005,629,093
URLs with Triples 369,254,196

On in a layperson’s terms, for this web corpus, parsed HTML URLs outnumber URLs with Triples between approximately eight to one.

Being mindful that the corpus is only web accessible data and excludes “dark data,” the need for a more robust solution that the Semantic Web is self-evident.

The failure of the Semantic Web is no assurance that any alternative proposal will fare better. Understanding why the Semantic Web is failing is a prerequisite to any successful alternative.


Before you “flame on,” you might want to read the entire series. I end up with a suggestion based on work by Ding, Shinavier, Finin and McGuinness.


The next series starts with Saving the “Semantic” Web (Part 1)

Bill Gates is naive, data is not objective [Neither is Identification]

Tuesday, February 5th, 2013

Bill Gates is naive, data is not objective by Cathy O’Neil.

From the post:

In his recent essay in the Wall Street Journal, Bill Gates proposed to “fix the world’s biggest problems” through “good measurement and a commitment to follow the data.” Sounds great!

Unfortunately it’s not so simple.

Gates describes a positive feedback loop when good data is collected and acted on. It’s hard to argue against this: given perfect data-collection procedures with relevant data, specific models do tend to improve, according to their chosen metrics of success. In fact this is almost tautological.

As I’ll explain, however, rather than focusing on how individual models improve with more data, we need to worry more about which models and which data have been chosen in the first place, why that process is successful when it is, and – most importantly – who gets to decide what data is collected and what models are trained.

Cathy makes a compelling case for data not being objective and concludes:

Don’t be fooled by the mathematical imprimatur: behind every model and every data set is a political process that chose that data and built that model and defined success for that model.

Sounds a lot like identifying subjects.

No identification is objective. They all occur as part of social processes and are bound by those processes.

No identification is “better” than another one, although is some contexts, particular identifications may be more useful that others.

I first saw this in Four short links: 4 February 2013 by Nat Torkington.

G2 | Sensemaking – Two Years Old Today

Sunday, February 3rd, 2013

G2 | Sensemaking – Two Years Old Today by Jeff Jonas.

From the post:

What is G2?

When I speak about Context Accumulation, Data Finds Data and Relevance Finds You, and Sensemaking I am describing various aspects of G2.

In simple terms G2 software is designed to integrate diverse observations (data) as it arrives, in real-time.  G2 does this incrementally, piece by piece, much in the same way you would put a puzzle together at home.  And just like at home, the more puzzle pieces integrated into the puzzle, the more complete the picture.  The more complete the picture, the better the ability to make sense of what has happened in the past, what is happening now, and what may come next.  Users of G2 technology will be more efficient, deliver high quality outcomes, and ultimately will be more competitive.

Early adopters seem to be especially interested in one specific use case: Using G2 to help organizations better direct the attention of its finite workforce.  With the workforce now focusing on the most important things first, G2 is then used to improve the quality of analysis while at the same time reducing the amount of time such analysis takes.  The bigger the organization, the bigger the observation space, the more essential sensemaking is.

About Sensemaking

One of the things G2 can already do pretty darn well – considering she just turned two years old – is ”Sensemaking.”  Imagine a system capable of paying very close attention to every observation that comes its way.  Each observation incrementally improving upon the picture and using this emerging picture in real-time to make higher quality business decisions; for example, the selection of the perfect ad for a web page (in sub-200 milliseconds as the user navigates to the page) or raising an alarm to a human for inspection (an alarm sufficiently important to be placed top of the queue).  G2, when used this way, enables Enterprise Intelligence.

Of course there is no magic.  Sensemaking engines are limited by their available observation space.  If a sentient being would be unable to make sense of the situation based on the available observation space, neither would G2.  I am not talking about Fantasy Analytics here.

I would say “subject identity” instead of “sensemaking” and after reading Jeff’s post, consider them to be synonyms.

Read the section General Purpose Context Accumulation very carefully.

As well as “Privacy by Design (PbD).”

BTW, G2 uses Universal Message Format XML for input/output.

Not to argue from authority but Jeff is one of only 77 active IBM Research Fellows.

Someone to listen to, even if we may disagree on some of the finer points.

Making Sense of Others’ Data Structures

Sunday, February 3rd, 2013

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept! 😉

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

New DataCorps Project: Refugees United

Sunday, January 27th, 2013

New DataCorps Project: Refugees United

From the post:

We are thrilled to announce the kick-off of a new DataKind project with Refugees United! Refugees United is a fantastic organization that uses mobile and web technologies to help refugees find their missing loved ones. Currently, RU’s system allows people to post descriptions of their family and friends as well as to search for them on the site. As you might imagine, lots of data flows through this system – data that could be used to greatly improve the way people find each other. Lead by the ever-brilliant Max Shron, the DataKind team is collaborating with Refugees United to explore what their data can tell them about how people are using the site, how they’re connecting to one another and, ultimately, how it can be used to help people find each other more effectively.

We are incredibly excited to work on this project and will be posting updates for you all as things unfoled. In the meantime, learn a bit more about Max and Refugees United.

I can’t comment on the identity practices because:

Q: 1.08 Why isn’t Refugees United open source yet?

Refugees United was born as an “offline” open source project. When we started, we were two guys (now six guys and a girl in Copenhagen, joined by a much larger team worldwide) with a great idea that had the potential to positively impact thousands, if not millions, of lives. The open source approach came from the fact that we wanted to build the world’s smallest refugee agency with the largest outreach, and to have the highest impact at the lowest cost.

One way to reach our objectives is to work with corporations around that world, including Ericsson, SAP, FedEx and others. The invaluable advice and expertise provided by these successful businesses – both the largest corporations and the smallest companies – have helped us to apply the structure and strategy of business to the passion and vision of an NGO.

Now the time has come for us to apply same structure to our software, and we have begun to collaborate with some of the wonderfully brilliant minds out there who wish to contribute and help us make a difference in the development of our technologies.

I am not sure what ‘”offline” open source’ means? The rest of the quoted prose doesn’t help.

Perhaps the software will become available online. At some point.

Would be a interesting data point to see how they are managing personal subject identity.

The Adams Workflow

Saturday, January 26th, 2013

The Adams Workflow

From the webpage:

The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

Same source as WEKA.

What if we think about identification as workflow?

Whatever stability we attribute to an identification is the absence of additional data that would create a change.

Looking backwards over prior identifications, we fit them into the schema of our present identification and that eliminates any movement from the past. The past is fixed and terminates in our present identification.

That view fails to appreciate the world isn’t going to end with any of us individually. The world and its information systems will continue, as will the workflow that defines identifications.

Replacing our identifications with newer ones.

The question we face is whether our actions will support or impede re-use of our identifications in the future.

I first saw Adams Workflow at Nat Torkington’s Four short links: 24 January 2013.

XQuery 3.0: An XML Query Language [Subject Identity Equivalence Language?]

Tuesday, January 15th, 2013

XQuery 3.0: An XML Query Language – W3C Candidate Recommendation

Abstract:

XML is a versatile markup language, capable of labeling the information content of diverse data sources including structured and semi-structured documents, relational databases, and object repositories. A query language that uses the structure of XML intelligently can express queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware. This specification describes a query language called XQuery, which is designed to be broadly applicable across many types of XML data sources.

Just starting to read the XQuery CR but the thought occurred to me that it could be a basis for a “subject identity equivalence language.”

Rather than duplicating the work on expressions, paths, data types, operators, expressions, etc., why not take all that as given?

Suffice it to define a “subject equivalence function,” the variables of which are XQuery statements that identify values (or value expressions) as required, optional or forbidden and the definition of the results of the function.

Reusing a well-tested query language seems preferable to writing an entirely new one from scratch.

Suggestions?

I first saw this in a tweet by Michael Kay.

The future of programming [A Cacophony of Semantic Primitives]

Tuesday, January 15th, 2013

The future of programming by Edd Dumbill.

You need to read Edd’s post on the future of programming in full, but there are two points I would like to pull out for your attention:

  1. Expansion of people engaged in programming:

    In our age of exploding data, the ability to do some kind of programming is increasingly important to every job, and programming is no longer the sole preserve of an engineering priesthood.

  2. Data as first class citizen

    As data and its analysis grow in importance, there’s a corresponding rise in use and popularity of languages that treat data as a first class citizen. Obviously, statistical languages such as R are rising on this tide, but within general purpose programming there’s a bias to languages such as Python or Clojure, which make data easier to manipulate.

The most famous occasion when a priesthood lost the power of sole interpretation was the Protestant Reformation.

Although there was already a wide range of interpretations, as the priesthood of believers grew over the centuries, so did the diversity of interpretation and semantics.

Even though there is a wide range of semantics in programming already, the broader participation becomes, the broader the semantics of programming will grow. Not in terms of the formal semantics as defined by language designers but as used by programmers.

Semantics being the province of usage, I am betting on semantics as used being the clear winner.

Data being treated as a first class citizen carries with it the seeds of even more semantic diversity. Data, after all, originates with users and is only meaningful when some user interprets it.

Users are going to “see” data as having the semantics they attribute to it, not the semantics as defined by other programmers or sources.

To use another analogy from religion, the Old Testament/Hebrew Bible can be read in the context of Ancient Near Eastern religions and practices or taken as a day by day calendar from the point of creation. And several variations in between. All relying on the same text.

For decades programmers have pretended programming was based on semantic primitives. Semantic primitives that could be reliably interchanged, albeit sometimes with difficulty, with other systems. But users and their data are shattering the illusion of semantic primitives.

More accurately they are putting other notions of semantic primitives into play.

A cacophony of semantic primitives bodes poorly for a future of distributed, device, data and democratized computing.

Avoidable to the degree that we choose to not silently rely upon others “knowing what we meant.”

I first saw this at The four D’s of programming’s future: data, distributed, device, democratized by David Smith.

Constructing Topological Spaces — A Primer

Friday, November 16th, 2012

Constructing Topological Spaces — A Primer by Jeremy Kun.

From the post:

Last time we investigated the (very unintuitive) concept of a topological space as a set of “points” endowed with a description of which subsets are open. Now in order to actually arrive at a discussion of interesting and useful topological spaces, we need to be able to take simple topological spaces and build them up into more complex ones. This will take the form of subspaces and quotients, and through these we will make rigorous the notion of “gluing” and “building” spaces.

More heavy sledding but pay special attention to the discussion of sets and equivalences.

Jeremy concludes with pointers to books for additional reading.

Any personal favorites you would like to add to the list?

Topological Spaces — A Primer

Friday, November 16th, 2012

Topological Spaces — A Primer by Jeremy Kun.

From the post:

In our last primer we looked at a number of interesting examples of metric spaces, that is, spaces in which we can compute distance in a reasonable way. Our goal for this post is to relax this assumption. That is, we want to study the geometric structure of space without the ability to define distance. That is not to say that some notion of distance necessarily exists under the surface somewhere, but rather that we include a whole new class of spaces for which no notion of distance makes sense. Indeed, even when there is a reasonable notion of a metric, we’ll still want to blur the lines as to what kinds of things we consider “the same.”

The reader might wonder how we can say anything about space if we can’t compute distances between things. Indeed, how could it even really be “space” as we know it? The short answer is: the reader shouldn’t think of a topological space as a space in the classical sense. While we will draw pictures and say some very geometric things about topological spaces, the words we use are only inspired by their classical analogues. In fact the general topological space will be a much wilder beast, with properties ranging from absolute complacency to rampant hooliganism. Even so, topological spaces can spring out of every mathematical cranny. They bring at least a loose structure to all sorts of problems, and so studying them is of vast importance.

Just before we continue, we should give a short list of how topological spaces are applied to the real world. In particular, this author is preparing a series of posts dedicated to the topological study of data. That is, we want to study the loose structure of data potentially embedded in a very high-dimensional metric space. But in studying it from a topological perspective, we aim to eliminate the dependence on specific metrics and parameters (which can be awfully constricting, and even impertinent to the overall structure of the data). In addition, topology has been used to study graphics, image analysis and 3D modelling, networks, semantics, protein folding, solving systems of polynomial equations, and loads of topics in physics.

Topology offers an alternative to the fiction of metric distances between the semantics of words. It is a useful fiction, but a fiction none the less.

Deep sledding but well worth the time.

Identities and Identifications: Politicized Uses of Collective Identities

Monday, September 17th, 2012

Identities and Identifications: Politicized Uses of Collective Identities

Deadline for Panels 15 January 2013
Deadline for Papers 1 March 2013
Conference 18-20 April 2013, Zagreb, Croatia

From the call for panels and papers:

Identity is one of the crown jewelleries in the kingdom of ‘contested concepts’. The idea of identity is conceived to provide some unity and recognition while it also exists by separation and differentiation. Few concepts were used as much as identity for contradictory purposes. From the fragile individual identities as self-solidifying frameworks to layered in-group identifications in families, orders, organizations, religions, ethnic groups, regions, nation-states, supra-national entities or any other social entities, the idea of identity always shows up in the core of debates and makes everything either too dangerously simple or too complicated. Constructivist and de-constructivist strategies have led to the same result: the eternal return of the topic. Some say we should drop the concept, some say we should keep it and refine it, some say we should look at it in a dynamic fashion while some say it’s the reason for resistance to change.

If identities are socially constructed and not genuine formations, they still hold some responsibility for inclusion/exclusion – self/other nexuses. Looking at identities in a research oriented manner provides explanatory tolls for a wide variety of events and social dynamics. Identities reflect the complex nature of human societies and generate reasonable comprehension for processes that cannot be explained by tracing pure rational driven pursuit of interests. The feelings of attachment, belonging, recognition, the processes of values’ formation and norms integration, the logics of appropriateness generated in social organizations are all factors relying on a certain type of identity or identification. Multiple identifications overlap, interact, include or exclude, conflict or enhance cooperation. Identities create boundaries and borders; define the in-group and the out-group, the similar and the excluded, the friend and the threatening, the insider and the ‘other’.

Beyond their dynamic fuzzy nature that escapes exhaustive explanations, identities are effective instruments of politicization of social life. The construction of social forms of organization and of specific social practices together with their imaginary significations requires all the time an essentialist or non-essentialist legitimating act of belonging; a social glue that extracts its cohesive function from the identification of the in-group and the power of naming the other. Identities are political. Multicultural slogans populate extensively the twenty-first century yet the distance between the ideal and the real multiculturalism persists while the virtues of inclusion coexist with the adversity of exclusion. Dealing with the identities means to integrate contestation into contestation until potentially a n degree of contestation. Due to the confusion between identities and identifications some scholars demanded that the concept of identity shall be abandoned. Identitarian issues turned out to be efficient tools for politicization of a ‘constraining dissensus’ while universalizing terms included in the making of the identities usually tend or intend to obscure the localized origins of any identitarian project. Identities are often conceptually used as rather intentional concepts: they don’t say anything about their sphere but rather defining the sphere makes explicit the aim of their usage. It is not ‘identity of’ but ‘identity to’.

Quick! Someone get them a URL! 😉 Just teasing.

Enjoy the conference!

Context-Aware Recommender Systems 2012 [Identity and Context?]

Tuesday, September 11th, 2012

Context-Aware Recommender Systems 2012 (In conjunction with the 6th ACM Conference on Recommender Systems (RecSys 2012))

I usually think of recommender systems as attempts to deliver content based on clues about my interests or context. If I dial 911, the location of the nearest pizza vendor probably isn’t high on my lists of interests, etc.

As I looked over these proceedings, it occurred to me that subject identity, for merging purposes, isn’t limited to the context of the subject in question.

That is some merging tests could depend upon my context as a user.

Take my 911 call for instance. For many purposes, a police substation, fire station, 24 hour medical clinic and a hospital are different subjects.

In a medical emergency situation, for which a 911 call might be a clue, all of those could be treated as a single subject – places for immediate medical attention.

What other subjects do you think might merge (or not) depending upon your context?

Table of Contents

  1. Optimal Feature Selection for Context-Aware Recommendation Using Differential Relaxation
    Yong Zheng, Robin Burke, Bamshad Mobasher.
  2. Relevant Context in a Movie Recommender System: Users’ Opinion vs. Statistical Detection
    Ante Odic, Marko Tkalcic, Jurij Franc Tasic, Andrej Kosir.
  3. Improving Novelty in Streaming Recommendation Using a Context Model
    Doina Alexandra Dumitrescu, Simone Santini.
  4. Towards a Context-Aware Photo Recommender System
    Fabricio Lemos, Rafael Carmo, Windson Viana, Rossana Andrade.
  5. Context and Intention-Awareness in POIs Recommender Systems
    Hernani Costa, Barbara Furtado, Durval Pires, Luis Macedo, F. Amilcar Cardoso.
  6. Evaluation and User Acceptance Issues of a Bayesian-Classifier-Based TV Recommendation System
    Benedikt Engelbert, Karsten Morisse, Kai-Christoph Hamborg.
  7. From Online Browsing to Offline Purchases: Analyzing Contextual Information in the Retail Business
    Simon Chan, Licia Capra.

‘The Algorithm That Runs the World’ [Optimization, Identity and Polytopes]

Tuesday, August 28th, 2012

“The Algorithm That Runs the World” by Erwin Gianchandani.

From the post:

New Scientist published a great story last week describing the history and evolution of the simplex algorithm — complete with a table capturing “2000 years of algorithms”:

The simplex algorithm directs wares to their destinations the world over [image courtesy PlainPicture/Gozooma via New Scientist].Its services are called upon thousands of times a second to ensure the world’s business runs smoothly — but are its mathematics as dependable as we thought?

YOU MIGHT not have heard of the algorithm that runs the world. Few people have, though it can determine much that goes on in our day-to-day lives: the food we have to eat, our schedule at work, when the train will come to take us there. Somewhere, in some server basement right now, it is probably working on some aspect of your life tomorrow, next week, in a year’s time.

Perhaps ignorance of the algorithm’s workings is bliss. The door to Plato’s Academy in ancient Athens is said to have borne the legend “let no one ignorant of geometry enter”. That was easy enough to say back then, when geometry was firmly grounded in the three dimensions of space our brains were built to cope with. But the algorithm operates in altogether higher planes. Four, five, thousands or even many millions of dimensions: these are the unimaginable spaces the algorithm’s series of mathematical instructions was devised to probe.

Perhaps, though, we should try a little harder to get our heads round it. Because powerful though it undoubtedly is, the algorithm is running into a spot of bother. Its mathematical underpinnings, though not yet structurally unsound, are beginning to crumble at the edges. With so much resting on it, the algorithm may not be quite as dependable as it once seemed [more following the link].

A fund manager might similarly want to arrange a portfolio optimally to balance risk and expected return over a range of stocks; a railway timetabler to decide how best to roster staff and trains; or a factory or hospital manager to work out how to juggle finite machine resources or ward space. Each such problem can be depicted as a geometrical shape whose number of dimensions is the number of variables in the problem, and whose boundaries are delineated by whatever constraints there are (see diagram). In each case, we need to box our way through this polytope towards its optimal point.

This is the job of the algorithm.

Its full name is the simplex algorithm, and it emerged in the late 1940s from the work of the US mathematician George Dantzig, who had spent the second world war investigating ways to increase the logistical efficiency of the U.S. air force. Dantzig was a pioneer in the field of what he called linear programming, which uses the mathematics of multidimensional polytopes to solve optimisation problems. One of the first insights he arrived at was that the optimum value of the “target function” — the thing we want to maximise or minimise, be that profit, travelling time or whatever — is guaranteed to lie at one of the corners of the polytope. This instantly makes things much more tractable: there are infinitely many points within any polytope, but only ever a finite number of corners.

If we have just a few dimensions and constraints to play with, this fact is all we need. We can feel our way along the edges of the polytope, testing the value of the target function at every corner until we find its sweet spot. But things rapidly escalate. Even just a 10-dimensional problem with 50 constraints — perhaps trying to assign a schedule of work to 10 people with different expertise and time constraints — may already land us with several billion corners to try out.

Apologies but I saw this article too late to post within the “free” days allowed by New Scientist.

But, I think from Erwin’s post and long quote from the original article, you can see how the simplex algorithm may be very useful where identity is defined in multidimensional space.

The literature in this area is vast and it may not offer an appropriate test for all questions of subject identity.

For example, the possessor of a credit card is presumed to be the owner of the card. Other assumptions are possible, but fraud costs are recouped from fees paid by customers. Creating a lack of interest in more stringent identity tests.

On the other hand, if your situation requires multidimensional identity measures, this may be a useful approach.


PS: Be aware that naming confusion, the sort that can be managed (not solved) by topic maps abounds even in mathematics:

The elements of a polytope are its vertices, edges, faces, cells and so on. The terminology for these is not entirely consistent across different authors. To give just a few examples: Some authors use face to refer to an (n−1)-dimensional element while others use face to denote a 2-face specifically, and others use j-face or k-face to indicate an element of j or k dimensions. Some sources use edge to refer to a ridge, while H. S. M. Coxeter uses cell to denote an (n−1)-dimensional element. (Polytope)

Modern Shape-Shifters

Monday, July 9th, 2012

Someday, in the not too distant future, you will be able to tell your grandchildren about fixed data structures and values. How queries returned the results imagined by the architects of data systems. Back in the old days of “small data.”

Quite different from the scene imagined in Sifting Through a Trillion Electrons:

Because FastQuery is built on the FastBit bitmap indexing technology, Byna notes that researchers can search their data based on an arbitrary range of conditions that is defined by available data values. This essentially means that a researcher can now feasibly search a trillion particle dataset and sift out electrons by their energy values.

Researchers, not data architects, get to decide on the questions to pose.

Not hard to imagine that “small data” experiments too will be making their data available. In a variety of forms and formats.

Are you ready to consolidate those data sources based on your identification of subjects? Subjects both in content and in formalisms/structure?

To have data that shifts its shape depending upon the demands upon it?

Will you be a master of modern shape-shifters?

PS: Do read the “Trillion Electron” piece. A view of this year’s data processing options. Likely to be succeeded by technology X in the next year or so if the past is any guide.

The observational roots of reference of the semantic web

Sunday, July 1st, 2012

The observational roots of reference of the semantic web by Simon Scheider, Krzysztof Janowicz, and Benjamin Adams.

Abstract:

Shared reference is an essential aspect of meaning. It is also indispensable for the semantic web, since it enables to weave the global graph, i.e., it allows different users to contribute to an identical referent. For example, an essential kind of referent is a geographic place, to which users may contribute observations. We argue for a human-centric, operational approach towards reference, based on respective human competences. These competences encompass perceptual, cognitive as well as technical ones, and together they allow humans to inter-subjectively refer to a phenomenon in their environment. The technology stack of the semantic web should be extended by such operations. This would allow establishing new kinds of observation-based reference systems that help constrain and integrate the semantic web bottom-up.

In arguing for recasting the problem of semantics as one of reference, the authors say:

Reference systems. Solutions to the problem of reference should transgress syntax as well as technology. They cannot solely rely on computers but must also rely on human referential competences. This requirement is met by reference systems [22]. Reference systems are different from ontologies in that they constrain meaning bottom-up [11]. Most importantly, they are not “yet another chimera” invented by ontology engineers, but already exist in various successful variants.

I rather like the “human referential competences….”

After all, useful semantic systems are about references that we recognize.

SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn’t Enough]

Sunday, July 1st, 2012

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

Happy Go Lucky Identification/Merging?

Tuesday, May 22nd, 2012

MIT News: New mathematical framework formalizes oddball programming techniques

From the post:

Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”

I like that for making topic maps scale, “…skip a bunch of steps….”

Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.

We are often mistaken about facts, faces, identifications in semantic terminology.

Why think we can build programs or machines that can do better?

Let’s stop rolling the identification stone up the hill.

Ask “how accurate does the identification/merging need to be?”

The answer for aiming a missile is probably different than sorting emails in a discovery process.

If you believe in hyperlinks:

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012

From Martin Rinard’s publication page.

Has other interesting reading.

Who Do You Say You Are?

Friday, May 11th, 2012

In Data Governance in Context, Jim Ericson outlines several paths of data governance, or as I put it: Who Do You Say You Are?:

On one path, more enterprises are dead serious about creating and using data they can trust and verify. It’s a simple equation. Data that isn’t properly owned and operated can’t be used for regulatory work, won’t be trusted to make significant business decisions and will never have the value organizations keep wanting to ascribe it on the balance sheet. We now know instinctively that with correct and thorough information, we can jump on opportunities, unite our understanding and steer the business better than before.

On a similar path, we embrace tested data in the marketplace (see Experian, D&B, etc.) that is trusted for a use case even if it does not conform to internal standards. Nothing wrong with that either.

And on yet another path (and areas between) it’s exploration and discovery of data that might engage huge general samples of data with imprecise value.

It’s clear that we cannot and won’t have the same governance standards for all the different data now facing an enterprise.

For starters, crowd sourced and third party data bring a new dimension, because “fitness for purpose” is by definition a relative term. You don’t need or want the same standard for how many thousands or millions of visitors used a website feature or clicked on a bundle in the way you maintain your customer or financial info.

Do mortgage-backed securities fall into the “…huge general samples of data with imprecise value?” I ask because I don’t work in the financial industry. Or do they not practice data governance, except to generate numbers for the auditors?

I mention this because I suspect that subject identity governance would be equally useful for topic map authoring.

For some topic maps, say on drug trials, need to have a high degree of reliability and auditability. As well as precise identification (even if double-blind) of the test subjects.

Or there may be different tests for subject identity, some of which appear to be less precise than others.

For example, merging all the topics entered by a particular operator in a day to look for patterns that may indicate they are not following data entry protocols. (It is hard to be as random as real data.)

As with most issues, there isn’t any hard and fast rule that works for all cases. You do need to document the rules you are following and for how long. It will help you test old rules and to formulate new ones. (“Document” meaning to write down. The vagaries of memory are insufficient.)