Construction of Controlled Vocabularies

Tuesday, April 2nd, 2013

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

 Example: salinity/saltiness VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

 Example: Mercury (planet) Mercury (metal) Mercury (automobile) Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

O Knoweldge Graph, Where Art Thou?

Monday, February 11th, 2013

O Knoweldge Graph, Where Art Thou? by Matthew Hurst.

From the post:

The web search community, in recent months and years, has heard quite a bit about the ‘knowledge graph’. The basic concept is reasonably straightforward – instead of a graph of pages, we propose a graph of knowledge where the nodes are atoms of information of some form and the links are relationships between those statements. The knowledge graph concept has become established enough for it to be used as a point of comparison between Bing and Google.

….

Much of what we see out there in the form of knowledge returned for searches is really isolated pockets of related information (the date and place of brith of a person, for example). The really interesting things start happening when the graphs of information become unified across type, allowing – as suggested by this example – the user to traverse from a performer to a venue to all the performers at that venue, etc. Perhaps ‘knowledge engineer’ will become a popular resume-buzz word in the near future as ‘data scientest’ has become recently.

Read Matthew’s post for the details of the comparison.

+1! to going from graphs of pages to graphs of “atoms of information.”

I am less certain about “…graphs of information become unified across type….”

What I am missing is the reason to think that “type,” unlike any other subject, will have a uniform identification.

If we solve the problem of not requiring “type” to have a uniform identification, why not apply that to other subjects as well?

Without an express or implied requirement for uniform identification, all manner of “interesting things” will be happening in knowledge graphs.

(Note the plural, knowledge graphs, not knowledge graph.)

The Semantic Web Is Failing — But Why? (Part 5)

Thursday, February 7th, 2013

Impoverished Identification by URI

There is one final part of the faliure of the Semantic Web puzzle to explore before we can talk about a solution.

In owl:sameAs and Linked Data: An Empircal Study, Ding, Shinavier, Finin and McGuinness write:

Our experimental results have led us to identify several issues involving the owl:sameAs property as it is used in practice in a linked data context. These include how best to manage owl:sameAs assertions from “third parties”, problems in merging assertions from sources with different contexts, and the need to explore an operational semantics distinct from the strict logical meaning provided by OWL.

To resolve varying usages of owl:sameAs, the authors go beyond identifications provided by a URI to look to other properties. For example:

Many owl:sameAs statements are asserted due to the equivalence of the primary feature of resource description, e.g. the URIs of FOAF profiles of a person may be linked just because they refer to the same person even if the URIs refer the person at different ages. The odd mashup on job-title in previous section is a good example for why the URIs in different FOAF profiles are not fully equivalent. Therefore, the empirical usage of owl:sameAs only captures the equivalence semantics on the projection of the URI on social entity dimension (removing the time and space dimensions). In thisway, owl:sameAs is used to indicate p artial equivalence between two different URIs, which should not be considered as full equivalence.

Knowing the dimensions covered by a URI and the dimensions covered by a property, it is possible to conduct better data integration using owl:sameAs. For example, since we know a URI of a person provides a temporal-spatial identity, descriptions using time-sensitive properties, e.g. age, height and workplace, should not be aggregated, while time-insensitive properties, such as eye color and social security number, may be aggregated in most cases.

When an identification is insufficient based on a single URI, additional properties can be considered.

My question then is why do ordinary users have to wait for experts to decide their identifications are insufficient? Why can’t we empower users to declare multiple properties, including URIs, as a means of identification?

It could be something as simple as JSON key/value pairs with a notation of “+” for must match, “-” for must not match, and “?” for optional to match.

A declaration of identity by users about the subjects in their documents. Who better to ask?

Not to mention that the more information supplies with for an identification, the more likely they are to communicate, successfully, with other users.

URIs may be Tim Berners-Lee’s nails, but they are insufficient to support the scaffolding required for robust communication.

The next series starts with Saving the “Semantic” Web (Part 1)

The Semantic Web Is Failing — But Why? (Part 1)

Thursday, February 7th, 2013

Introduction

Before proposing yet another method for identification and annotation of entities in digital media, it is important to draw lessons from existing systems. Failing systems in particular, so their mistakes are not repeated or compounded. The Semantic Web is an example of such a system.

Doubters of that claim should the report Additional Statistics and Analysis of the Web Data Commons August 2012 Corpus by Web Data Commons.

Web Data Commons is a structured data research project based at the Research Group Data and Web Science at the University of Mannheim and the Institute AIFB at the Karlsruhe Institute of Technology. Supported by PlanetData and LOD2 research projects, the Web Data Commons is not opposed to the Semantic Web.

Altogether we discovered structured data within 369 million of the 3 billion pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million among the 40.5 million websites (PLDs) contained in the corpus (5.65%). Approximately 519 thousand websites use RDFa, while only 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. It is interesting to see that Microformats are used by approximately 2.5 times as many websites as RDFa and Microdata together. (emphasis added)

To sharpen the point, RDFa is 1.28% of the 40.5 million websites, eight (8) years after its introduction (2004) and four (4) years after reaching Recommendation status (2008).

Or more generally:

 Parsed HTML URLs 3,005,629,093 URLs with Triples 369,254,196

On in a layperson’s terms, for this web corpus, parsed HTML URLs outnumber URLs with Triples between approximately eight to one.

Being mindful that the corpus is only web accessible data and excludes “dark data,” the need for a more robust solution that the Semantic Web is self-evident.

The failure of the Semantic Web is no assurance that any alternative proposal will fare better. Understanding why the Semantic Web is failing is a prerequisite to any successful alternative.

Before you “flame on,” you might want to read the entire series. I end up with a suggestion based on work by Ding, Shinavier, Finin and McGuinness.

The next series starts with Saving the “Semantic” Web (Part 1)

Bill Gates is naive, data is not objective [Neither is Identification]

Tuesday, February 5th, 2013

Bill Gates is naive, data is not objective by Cathy O’Neil.

From the post:

In his recent essay in the Wall Street Journal, Bill Gates proposed to “fix the world’s biggest problems” through “good measurement and a commitment to follow the data.” Sounds great!

Unfortunately it’s not so simple.

Gates describes a positive feedback loop when good data is collected and acted on. It’s hard to argue against this: given perfect data-collection procedures with relevant data, specific models do tend to improve, according to their chosen metrics of success. In fact this is almost tautological.

As I’ll explain, however, rather than focusing on how individual models improve with more data, we need to worry more about which models and which data have been chosen in the first place, why that process is successful when it is, and – most importantly – who gets to decide what data is collected and what models are trained.

Cathy makes a compelling case for data not being objective and concludes:

Don’t be fooled by the mathematical imprimatur: behind every model and every data set is a political process that chose that data and built that model and defined success for that model.

Sounds a lot like identifying subjects.

No identification is objective. They all occur as part of social processes and are bound by those processes.

No identification is “better” than another one, although is some contexts, particular identifications may be more useful that others.

I first saw this in Four short links: 4 February 2013 by Nat Torkington.

G2 | Sensemaking – Two Years Old Today

Sunday, February 3rd, 2013

G2 | Sensemaking – Two Years Old Today by Jeff Jonas.

From the post:

What is G2?

When I speak about Context Accumulation, Data Finds Data and Relevance Finds You, and Sensemaking I am describing various aspects of G2.

In simple terms G2 software is designed to integrate diverse observations (data) as it arrives, in real-time.  G2 does this incrementally, piece by piece, much in the same way you would put a puzzle together at home.  And just like at home, the more puzzle pieces integrated into the puzzle, the more complete the picture.  The more complete the picture, the better the ability to make sense of what has happened in the past, what is happening now, and what may come next.  Users of G2 technology will be more efficient, deliver high quality outcomes, and ultimately will be more competitive.

Early adopters seem to be especially interested in one specific use case: Using G2 to help organizations better direct the attention of its finite workforce.  With the workforce now focusing on the most important things first, G2 is then used to improve the quality of analysis while at the same time reducing the amount of time such analysis takes.  The bigger the organization, the bigger the observation space, the more essential sensemaking is.

One of the things G2 can already do pretty darn well – considering she just turned two years old – is ”Sensemaking.”  Imagine a system capable of paying very close attention to every observation that comes its way.  Each observation incrementally improving upon the picture and using this emerging picture in real-time to make higher quality business decisions; for example, the selection of the perfect ad for a web page (in sub-200 milliseconds as the user navigates to the page) or raising an alarm to a human for inspection (an alarm sufficiently important to be placed top of the queue).  G2, when used this way, enables Enterprise Intelligence.

Of course there is no magic.  Sensemaking engines are limited by their available observation space.  If a sentient being would be unable to make sense of the situation based on the available observation space, neither would G2.  I am not talking about Fantasy Analytics here.

I would say “subject identity” instead of “sensemaking” and after reading Jeff’s post, consider them to be synonyms.

Read the section General Purpose Context Accumulation very carefully.

As well as “Privacy by Design (PbD).”

BTW, G2 uses Universal Message Format XML for input/output.

Not to argue from authority but Jeff is one of only 77 active IBM Research Fellows.

Someone to listen to, even if we may disagree on some of the finer points.

Making Sense of Others’ Data Structures

Sunday, February 3rd, 2013

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept!

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

New DataCorps Project: Refugees United

Sunday, January 27th, 2013

New DataCorps Project: Refugees United

From the post:

We are thrilled to announce the kick-off of a new DataKind project with Refugees United! Refugees United is a fantastic organization that uses mobile and web technologies to help refugees find their missing loved ones. Currently, RU’s system allows people to post descriptions of their family and friends as well as to search for them on the site. As you might imagine, lots of data flows through this system – data that could be used to greatly improve the way people find each other. Lead by the ever-brilliant Max Shron, the DataKind team is collaborating with Refugees United to explore what their data can tell them about how people are using the site, how they’re connecting to one another and, ultimately, how it can be used to help people find each other more effectively.

We are incredibly excited to work on this project and will be posting updates for you all as things unfoled. In the meantime, learn a bit more about Max and Refugees United.

I can’t comment on the identity practices because:

Q: 1.08 Why isn’t Refugees United open source yet?

Refugees United was born as an “offline” open source project. When we started, we were two guys (now six guys and a girl in Copenhagen, joined by a much larger team worldwide) with a great idea that had the potential to positively impact thousands, if not millions, of lives. The open source approach came from the fact that we wanted to build the world’s smallest refugee agency with the largest outreach, and to have the highest impact at the lowest cost.

One way to reach our objectives is to work with corporations around that world, including Ericsson, SAP, FedEx and others. The invaluable advice and expertise provided by these successful businesses – both the largest corporations and the smallest companies – have helped us to apply the structure and strategy of business to the passion and vision of an NGO.

Now the time has come for us to apply same structure to our software, and we have begun to collaborate with some of the wonderfully brilliant minds out there who wish to contribute and help us make a difference in the development of our technologies.

I am not sure what ‘”offline” open source’ means? The rest of the quoted prose doesn’t help.

Perhaps the software will become available online. At some point.

Would be a interesting data point to see how they are managing personal subject identity.

Saturday, January 26th, 2013

From the webpage:

The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

Same source as WEKA.

What if we think about identification as workflow?

Whatever stability we attribute to an identification is the absence of additional data that would create a change.

Looking backwards over prior identifications, we fit them into the schema of our present identification and that eliminates any movement from the past. The past is fixed and terminates in our present identification.

That view fails to appreciate the world isn’t going to end with any of us individually. The world and its information systems will continue, as will the workflow that defines identifications.

Replacing our identifications with newer ones.

The question we face is whether our actions will support or impede re-use of our identifications in the future.

I first saw Adams Workflow at Nat Torkington’s Four short links: 24 January 2013.

XQuery 3.0: An XML Query Language [Subject Identity Equivalence Language?]

Tuesday, January 15th, 2013

XQuery 3.0: An XML Query Language – W3C Candidate Recommendation

Abstract:

XML is a versatile markup language, capable of labeling the information content of diverse data sources including structured and semi-structured documents, relational databases, and object repositories. A query language that uses the structure of XML intelligently can express queries across all these kinds of data, whether physically stored in XML or viewed as XML via middleware. This specification describes a query language called XQuery, which is designed to be broadly applicable across many types of XML data sources.

Just starting to read the XQuery CR but the thought occurred to me that it could be a basis for a “subject identity equivalence language.”

Rather than duplicating the work on expressions, paths, data types, operators, expressions, etc., why not take all that as given?

Suffice it to define a “subject equivalence function,” the variables of which are XQuery statements that identify values (or value expressions) as required, optional or forbidden and the definition of the results of the function.

Reusing a well-tested query language seems preferable to writing an entirely new one from scratch.

Suggestions?

I first saw this in a tweet by Michael Kay.

The future of programming [A Cacophony of Semantic Primitives]

Tuesday, January 15th, 2013

The future of programming by Edd Dumbill.

You need to read Edd’s post on the future of programming in full, but there are two points I would like to pull out for your attention:

1. Expansion of people engaged in programming:

In our age of exploding data, the ability to do some kind of programming is increasingly important to every job, and programming is no longer the sole preserve of an engineering priesthood.

2. Data as first class citizen

As data and its analysis grow in importance, there’s a corresponding rise in use and popularity of languages that treat data as a first class citizen. Obviously, statistical languages such as R are rising on this tide, but within general purpose programming there’s a bias to languages such as Python or Clojure, which make data easier to manipulate.

The most famous occasion when a priesthood lost the power of sole interpretation was the Protestant Reformation.

Although there was already a wide range of interpretations, as the priesthood of believers grew over the centuries, so did the diversity of interpretation and semantics.

Even though there is a wide range of semantics in programming already, the broader participation becomes, the broader the semantics of programming will grow. Not in terms of the formal semantics as defined by language designers but as used by programmers.

Semantics being the province of usage, I am betting on semantics as used being the clear winner.

Data being treated as a first class citizen carries with it the seeds of even more semantic diversity. Data, after all, originates with users and is only meaningful when some user interprets it.

Users are going to “see” data as having the semantics they attribute to it, not the semantics as defined by other programmers or sources.

To use another analogy from religion, the Old Testament/Hebrew Bible can be read in the context of Ancient Near Eastern religions and practices or taken as a day by day calendar from the point of creation. And several variations in between. All relying on the same text.

For decades programmers have pretended programming was based on semantic primitives. Semantic primitives that could be reliably interchanged, albeit sometimes with difficulty, with other systems. But users and their data are shattering the illusion of semantic primitives.

More accurately they are putting other notions of semantic primitives into play.

A cacophony of semantic primitives bodes poorly for a future of distributed, device, data and democratized computing.

Avoidable to the degree that we choose to not silently rely upon others “knowing what we meant.”

I first saw this at The four D’s of programming’s future: data, distributed, device, democratized by David Smith.

Constructing Topological Spaces — A Primer

Friday, November 16th, 2012

Constructing Topological Spaces — A Primer by Jeremy Kun.

From the post:

Last time we investigated the (very unintuitive) concept of a topological space as a set of “points” endowed with a description of which subsets are open. Now in order to actually arrive at a discussion of interesting and useful topological spaces, we need to be able to take simple topological spaces and build them up into more complex ones. This will take the form of subspaces and quotients, and through these we will make rigorous the notion of “gluing” and “building” spaces.

More heavy sledding but pay special attention to the discussion of sets and equivalences.

Any personal favorites you would like to add to the list?

Topological Spaces — A Primer

Friday, November 16th, 2012

Topological Spaces — A Primer by Jeremy Kun.

From the post:

In our last primer we looked at a number of interesting examples of metric spaces, that is, spaces in which we can compute distance in a reasonable way. Our goal for this post is to relax this assumption. That is, we want to study the geometric structure of space without the ability to define distance. That is not to say that some notion of distance necessarily exists under the surface somewhere, but rather that we include a whole new class of spaces for which no notion of distance makes sense. Indeed, even when there is a reasonable notion of a metric, we’ll still want to blur the lines as to what kinds of things we consider “the same.”

The reader might wonder how we can say anything about space if we can’t compute distances between things. Indeed, how could it even really be “space” as we know it? The short answer is: the reader shouldn’t think of a topological space as a space in the classical sense. While we will draw pictures and say some very geometric things about topological spaces, the words we use are only inspired by their classical analogues. In fact the general topological space will be a much wilder beast, with properties ranging from absolute complacency to rampant hooliganism. Even so, topological spaces can spring out of every mathematical cranny. They bring at least a loose structure to all sorts of problems, and so studying them is of vast importance.

Just before we continue, we should give a short list of how topological spaces are applied to the real world. In particular, this author is preparing a series of posts dedicated to the topological study of data. That is, we want to study the loose structure of data potentially embedded in a very high-dimensional metric space. But in studying it from a topological perspective, we aim to eliminate the dependence on specific metrics and parameters (which can be awfully constricting, and even impertinent to the overall structure of the data). In addition, topology has been used to study graphics, image analysis and 3D modelling, networks, semantics, protein folding, solving systems of polynomial equations, and loads of topics in physics.

Topology offers an alternative to the fiction of metric distances between the semantics of words. It is a useful fiction, but a fiction none the less.

Deep sledding but well worth the time.

Identities and Identifications: Politicized Uses of Collective Identities

Monday, September 17th, 2012

Identities and Identifications: Politicized Uses of Collective Identities

Deadline for Panels 15 January 2013
Deadline for Papers 1 March 2013
Conference 18-20 April 2013, Zagreb, Croatia

From the call for panels and papers:

Identity is one of the crown jewelleries in the kingdom of ‘contested concepts’. The idea of identity is conceived to provide some unity and recognition while it also exists by separation and differentiation. Few concepts were used as much as identity for contradictory purposes. From the fragile individual identities as self-solidifying frameworks to layered in-group identifications in families, orders, organizations, religions, ethnic groups, regions, nation-states, supra-national entities or any other social entities, the idea of identity always shows up in the core of debates and makes everything either too dangerously simple or too complicated. Constructivist and de-constructivist strategies have led to the same result: the eternal return of the topic. Some say we should drop the concept, some say we should keep it and refine it, some say we should look at it in a dynamic fashion while some say it’s the reason for resistance to change.

If identities are socially constructed and not genuine formations, they still hold some responsibility for inclusion/exclusion – self/other nexuses. Looking at identities in a research oriented manner provides explanatory tolls for a wide variety of events and social dynamics. Identities reflect the complex nature of human societies and generate reasonable comprehension for processes that cannot be explained by tracing pure rational driven pursuit of interests. The feelings of attachment, belonging, recognition, the processes of values’ formation and norms integration, the logics of appropriateness generated in social organizations are all factors relying on a certain type of identity or identification. Multiple identifications overlap, interact, include or exclude, conflict or enhance cooperation. Identities create boundaries and borders; define the in-group and the out-group, the similar and the excluded, the friend and the threatening, the insider and the ‘other’.

Beyond their dynamic fuzzy nature that escapes exhaustive explanations, identities are effective instruments of politicization of social life. The construction of social forms of organization and of specific social practices together with their imaginary significations requires all the time an essentialist or non-essentialist legitimating act of belonging; a social glue that extracts its cohesive function from the identification of the in-group and the power of naming the other. Identities are political. Multicultural slogans populate extensively the twenty-first century yet the distance between the ideal and the real multiculturalism persists while the virtues of inclusion coexist with the adversity of exclusion. Dealing with the identities means to integrate contestation into contestation until potentially a n degree of contestation. Due to the confusion between identities and identifications some scholars demanded that the concept of identity shall be abandoned. Identitarian issues turned out to be efficient tools for politicization of a ‘constraining dissensus’ while universalizing terms included in the making of the identities usually tend or intend to obscure the localized origins of any identitarian project. Identities are often conceptually used as rather intentional concepts: they don’t say anything about their sphere but rather defining the sphere makes explicit the aim of their usage. It is not ‘identity of’ but ‘identity to’.

Quick! Someone get them a URL! Just teasing.

Enjoy the conference!

Context-Aware Recommender Systems 2012 [Identity and Context?]

Tuesday, September 11th, 2012

Context-Aware Recommender Systems 2012 (In conjunction with the 6th ACM Conference on Recommender Systems (RecSys 2012))

I usually think of recommender systems as attempts to deliver content based on clues about my interests or context. If I dial 911, the location of the nearest pizza vendor probably isn’t high on my lists of interests, etc.

As I looked over these proceedings, it occurred to me that subject identity, for merging purposes, isn’t limited to the context of the subject in question.

That is some merging tests could depend upon my context as a user.

Take my 911 call for instance. For many purposes, a police substation, fire station, 24 hour medical clinic and a hospital are different subjects.

In a medical emergency situation, for which a 911 call might be a clue, all of those could be treated as a single subject – places for immediate medical attention.

What other subjects do you think might merge (or not) depending upon your context?

‘The Algorithm That Runs the World’ [Optimization, Identity and Polytopes]

Tuesday, August 28th, 2012

“The Algorithm That Runs the World” by Erwin Gianchandani.

From the post:

New Scientist published a great story last week describing the history and evolution of the simplex algorithm — complete with a table capturing “2000 years of algorithms”:

The simplex algorithm directs wares to their destinations the world over [image courtesy PlainPicture/Gozooma via New Scientist].Its services are called upon thousands of times a second to ensure the world’s business runs smoothly — but are its mathematics as dependable as we thought?

YOU MIGHT not have heard of the algorithm that runs the world. Few people have, though it can determine much that goes on in our day-to-day lives: the food we have to eat, our schedule at work, when the train will come to take us there. Somewhere, in some server basement right now, it is probably working on some aspect of your life tomorrow, next week, in a year’s time.

Perhaps ignorance of the algorithm’s workings is bliss. The door to Plato’s Academy in ancient Athens is said to have borne the legend “let no one ignorant of geometry enter”. That was easy enough to say back then, when geometry was firmly grounded in the three dimensions of space our brains were built to cope with. But the algorithm operates in altogether higher planes. Four, five, thousands or even many millions of dimensions: these are the unimaginable spaces the algorithm’s series of mathematical instructions was devised to probe.

Perhaps, though, we should try a little harder to get our heads round it. Because powerful though it undoubtedly is, the algorithm is running into a spot of bother. Its mathematical underpinnings, though not yet structurally unsound, are beginning to crumble at the edges. With so much resting on it, the algorithm may not be quite as dependable as it once seemed [more following the link].

A fund manager might similarly want to arrange a portfolio optimally to balance risk and expected return over a range of stocks; a railway timetabler to decide how best to roster staff and trains; or a factory or hospital manager to work out how to juggle finite machine resources or ward space. Each such problem can be depicted as a geometrical shape whose number of dimensions is the number of variables in the problem, and whose boundaries are delineated by whatever constraints there are (see diagram). In each case, we need to box our way through this polytope towards its optimal point.

This is the job of the algorithm.

Its full name is the simplex algorithm, and it emerged in the late 1940s from the work of the US mathematician George Dantzig, who had spent the second world war investigating ways to increase the logistical efficiency of the U.S. air force. Dantzig was a pioneer in the field of what he called linear programming, which uses the mathematics of multidimensional polytopes to solve optimisation problems. One of the first insights he arrived at was that the optimum value of the “target function” — the thing we want to maximise or minimise, be that profit, travelling time or whatever — is guaranteed to lie at one of the corners of the polytope. This instantly makes things much more tractable: there are infinitely many points within any polytope, but only ever a finite number of corners.

If we have just a few dimensions and constraints to play with, this fact is all we need. We can feel our way along the edges of the polytope, testing the value of the target function at every corner until we find its sweet spot. But things rapidly escalate. Even just a 10-dimensional problem with 50 constraints — perhaps trying to assign a schedule of work to 10 people with different expertise and time constraints — may already land us with several billion corners to try out.

Apologies but I saw this article too late to post within the “free” days allowed by New Scientist.

But, I think from Erwin’s post and long quote from the original article, you can see how the simplex algorithm may be very useful where identity is defined in multidimensional space.

The literature in this area is vast and it may not offer an appropriate test for all questions of subject identity.

For example, the possessor of a credit card is presumed to be the owner of the card. Other assumptions are possible, but fraud costs are recouped from fees paid by customers. Creating a lack of interest in more stringent identity tests.

On the other hand, if your situation requires multidimensional identity measures, this may be a useful approach.

PS: Be aware that naming confusion, the sort that can be managed (not solved) by topic maps abounds even in mathematics:

The elements of a polytope are its vertices, edges, faces, cells and so on. The terminology for these is not entirely consistent across different authors. To give just a few examples: Some authors use face to refer to an (n−1)-dimensional element while others use face to denote a 2-face specifically, and others use j-face or k-face to indicate an element of j or k dimensions. Some sources use edge to refer to a ridge, while H. S. M. Coxeter uses cell to denote an (n−1)-dimensional element. (Polytope)

Modern Shape-Shifters

Monday, July 9th, 2012

Someday, in the not too distant future, you will be able to tell your grandchildren about fixed data structures and values. How queries returned the results imagined by the architects of data systems. Back in the old days of “small data.”

Quite different from the scene imagined in Sifting Through a Trillion Electrons:

Because FastQuery is built on the FastBit bitmap indexing technology, Byna notes that researchers can search their data based on an arbitrary range of conditions that is defined by available data values. This essentially means that a researcher can now feasibly search a trillion particle dataset and sift out electrons by their energy values.

Researchers, not data architects, get to decide on the questions to pose.

Not hard to imagine that “small data” experiments too will be making their data available. In a variety of forms and formats.

Are you ready to consolidate those data sources based on your identification of subjects? Subjects both in content and in formalisms/structure?

To have data that shifts its shape depending upon the demands upon it?

Will you be a master of modern shape-shifters?

PS: Do read the “Trillion Electron” piece. A view of this year’s data processing options. Likely to be succeeded by technology X in the next year or so if the past is any guide.

The observational roots of reference of the semantic web

Sunday, July 1st, 2012

The observational roots of reference of the semantic web by Simon Scheider, Krzysztof Janowicz, and Benjamin Adams.

Abstract:

Shared reference is an essential aspect of meaning. It is also indispensable for the semantic web, since it enables to weave the global graph, i.e., it allows different users to contribute to an identical referent. For example, an essential kind of referent is a geographic place, to which users may contribute observations. We argue for a human-centric, operational approach towards reference, based on respective human competences. These competences encompass perceptual, cognitive as well as technical ones, and together they allow humans to inter-subjectively refer to a phenomenon in their environment. The technology stack of the semantic web should be extended by such operations. This would allow establishing new kinds of observation-based reference systems that help constrain and integrate the semantic web bottom-up.

In arguing for recasting the problem of semantics as one of reference, the authors say:

Reference systems. Solutions to the problem of reference should transgress syntax as well as technology. They cannot solely rely on computers but must also rely on human referential competences. This requirement is met by reference systems [22]. Reference systems are different from ontologies in that they constrain meaning bottom-up [11]. Most importantly, they are not “yet another chimera” invented by ontology engineers, but already exist in various successful variants.

I rather like the “human referential competences….”

After all, useful semantic systems are about references that we recognize.

SkyQuery: …Parallel Probabilistic Join Engine… [When Static Mapping Isn't Enough]

Sunday, July 1st, 2012

SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases by László Dobos, Tamás Budavári, Nolan Li, Alexander S. Szalay, and István Csabai.

Abstract:

Multi-wavelength astronomical studies require cross-identification of detections of the same celestial objects in multiple catalogs based on spherical coordinates and other properties. Because of the large data volumes and spherical geometry, the symmetric N-way association of astronomical detections is a computationally intensive problem, even when sophisticated indexing schemes are used to exclude obviously false candidates. Legacy astronomical catalogs already contain detections of more than a hundred million objects while the ongoing and future surveys will produce catalogs of billions of objects with multiple detections of each at different times. The varying statistical error of position measurements, moving and extended objects, and other physical properties make it necessary to perform the cross-identification using a mathematically correct, proper Bayesian probabilistic algorithm, capable of including various priors. One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios. Consequently, a novel system is necessary that can cross-identify multiple catalogs on-demand, efficiently and reliably. In this paper, we present our solution based on a cluster of commodity servers and ordinary relational databases. The cross-identification problems are formulated in a language based on SQL, but extended with special clauses. These special queries are partitioned spatially by coordinate ranges and compiled into a complex workflow of ordinary SQL queries. Workflows are then executed in a parallel framework using a cluster of servers hosting identical mirrors of the same data sets.

Astronomy is a cool area to study and has data out the wazoo, but I was struck by:

One time, pair-wise cross-identification of these large catalogs is not sufficient for many astronomical scenarios.

Is identity with sharp edges, susceptible to pair-wise mapping, the common case?

Or do we just see some identity issues that way?

Commend the paper to you as an example of dynamic merging practice.

Happy Go Lucky Identification/Merging?

Tuesday, May 22nd, 2012

From the post:

Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”

I like that for making topic maps scale, “…skip a bunch of steps….”

Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.

We are often mistaken about facts, faces, identifications in semantic terminology.

Why think we can build programs or machines that can do better?

Let’s stop rolling the identification stone up the hill.

Ask “how accurate does the identification/merging need to be?”

The answer for aiming a missile is probably different than sorting emails in a discovery process.

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012

Who Do You Say You Are?

Friday, May 11th, 2012

In Data Governance in Context, Jim Ericson outlines several paths of data governance, or as I put it: Who Do You Say You Are?:

On one path, more enterprises are dead serious about creating and using data they can trust and verify. It’s a simple equation. Data that isn’t properly owned and operated can’t be used for regulatory work, won’t be trusted to make significant business decisions and will never have the value organizations keep wanting to ascribe it on the balance sheet. We now know instinctively that with correct and thorough information, we can jump on opportunities, unite our understanding and steer the business better than before.

On a similar path, we embrace tested data in the marketplace (see Experian, D&B, etc.) that is trusted for a use case even if it does not conform to internal standards. Nothing wrong with that either.

And on yet another path (and areas between) it’s exploration and discovery of data that might engage huge general samples of data with imprecise value.

It’s clear that we cannot and won’t have the same governance standards for all the different data now facing an enterprise.

For starters, crowd sourced and third party data bring a new dimension, because “fitness for purpose” is by definition a relative term. You don’t need or want the same standard for how many thousands or millions of visitors used a website feature or clicked on a bundle in the way you maintain your customer or financial info.

Do mortgage-backed securities fall into the “…huge general samples of data with imprecise value?” I ask because I don’t work in the financial industry. Or do they not practice data governance, except to generate numbers for the auditors?

I mention this because I suspect that subject identity governance would be equally useful for topic map authoring.

For some topic maps, say on drug trials, need to have a high degree of reliability and auditability. As well as precise identification (even if double-blind) of the test subjects.

Or there may be different tests for subject identity, some of which appear to be less precise than others.

For example, merging all the topics entered by a particular operator in a day to look for patterns that may indicate they are not following data entry protocols. (It is hard to be as random as real data.)

As with most issues, there isn’t any hard and fast rule that works for all cases. You do need to document the rules you are following and for how long. It will help you test old rules and to formulate new ones. (“Document” meaning to write down. The vagaries of memory are insufficient.)

20 More Reasons You Need Topic Maps

Thursday, May 3rd, 2012

Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).

Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.

So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.

Data Management is Based on Philosophy, Not Science

Tuesday, May 1st, 2012

Data Management is Based on Philosophy, Not Science by Malcolm Chisholm.

From the post:

There’s a joke running around on Twitter that the definition of a data scientist is “a data analyst who lives in California.” I’m sure the good natured folks of the Golden State will not object to me bringing this up to make a point. The point is: Thinking purely in terms of marketing, which is a better title — data scientist or data philosopher?

My instincts tell me there is no contest. The term data scientist conjures up an image of a tense, driven individual, surrounded by complex technology in a laboratory somewhere, wrestling valuable secrets out of the strange substance called data. By contrast, the term data philosopher brings to mind a pipe-smoking elderly gentleman sitting in a winged chair in some dusty recess of academia where he occasionally engages in meaningless word games with like-minded individuals.

These stereotypes are obviously crude, but they are probably what would come into the minds of most executive managers. Yet how true are they? I submit that there is a strong case that data management is much more like applied philosophy than it is like applied science.

Applied philosophy. I like that!

You know where I am going to come out on this issue so I won’t belabor it.

Semantically Diverse Christenings

Sunday, April 29th, 2012

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

Akaros – an open source operating system for manycore architectures

Saturday, April 28th, 2012

Akaros – an open source operating system for manycore architectures

From the post:

If you are interested in future foward OS designs then you might find Akaros worth a look. It’s an operating system designed for many-core architectures and large-scale SMP systems, with the goals of:

• Providing better support for parallel and high-performance applications
• Scaling the operating system to a large number of cores

A more indepth explanation of the motiviation behind Akaros can be found in Improving Per-Node Efﬁciency in the Datacenter with NewOS Abstractions by Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer.

From the paper abstract:

Traditional operating system abstractions are ill-suited for high performance and parallel applications, especially on large-scale SMP and many-core architectures. We propose four key ideas that help to overcome these limitations. These ideas are built on a philosophy of exposing as much information to applications as possible and giving them the tools necessary to take advantage of that information to run more efficiently. In short, high-performance applications need to be able to peer through layers of virtualization in the software stack to optimize their behavior. We explore abstractions based on these ideas and discuss how we build them in the context of a new operating system called Akaros.

Rather than “layers of virtualization” I would say: “layers of identifiable subjects.” That’s hardly surprising but it has implications for this paper and future successors on the same issue.

Issues of inefficiency aren’t due to a lack of programming talent, as the authors ably demonstrate, but rather the limitations placed upon that talent by the subjects our operating systems identify and permit to be addressed.

The paper is an exercise in identifying different subjects than those identified in contemporary operating systems. That abstraction may assist future researchers in positing different subjects for identification and consequences that flow from identifying different subjects.

Making Search Hard(er)

Friday, April 27th, 2012

Rafael Maia posts:

first R, now Julia… are programmers trying on purpose to come up with names for their languages that make it hard to google for info?

I don’t know that two cases prove that programmers are responsible for all the semantic confusion in the world.

A search for FORTRAN produces FORTRAN Formula Translation/Translator.

But compare COBOL:

COBOL Completely Over and Beyond Obvious Logic
COBOL Compiles Only By Odd Luck
COBOL Completely Obsolete Burdensome Old Language

May be something to programmers peeing in the semantic pool.

On the other hand, there are examples prior to programming of semantic overloading of strings.

Here is an interesting question:

Does your answer impact how you would build a search engine? Why/why not?

Just what do you mean by “number”?

Wednesday, April 25th, 2012

Just what do you mean by “number”?

John D. Cook writes:

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

An eighteen (18) question subset of all the questions about what is meant by “number.”

The simplicity of identity is a polite veneer over boundless complexity.

Most of the time the complexity can remain hidden. Most of the time.

Other times, we are hobbled if our information systems keep us from peeking.

Peeping Tom takes us on an abbreviated tour around the identity of “number.”

Past, Present and Future – The Quest to be Understood

Friday, April 20th, 2012

Without restricting it to being machine readable, I think we would all agree there are three ages of data:

1. Past data
2. Present data
3. Future data

And we have common goals for data (or parts of it):

1. Past data – To understand past data.
2. Present data – To be understood by others.
3. Future data – For our present data to persist and be understood by then users.

Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)

I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.

You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.

Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:

I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.

Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.

Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.

What do you think?

Standardizing Federal Transparency

Friday, April 20th, 2012

Standardizing Federal Transparency

From the post:

A new federal data transparency coalition is pushing for standardization of government documents and support for legislation on public records disclosures, taxpayer spending and business identification codes.

The Data Transparency Coalition announced its official launch Monday, vowing nonpartisan work with Congress and the Executive Branch on ventures toward digital publishing of government documents in a standardized and integrated formats. As part of that effort, the coalition expressed its support of legislative proposals such as: the Digital Accountability and Transparency Act, which would open public spending records published on a single digital format; the Public Information Online Act, which pushes for all records to be released digitally in a machine-readable format; and the Legal Entity Identifier proposal, creating a standard ID code for companies.

The 14 founding members include vendors Microsoft, Teradata, MarkLogic, Rivet Software, Level One Technologies and Synteractive, as well as the Maryland Association of CPAs, financial advisory BrightScope, and data mining and pattern discovery consultancy Elder Research. The coalition board of advisors includes former U.S. Deputy CTO Beth Noveck, data and information services investment firm partner Eric Gillespie and former Recovery Accountability and Transparency Board Chairman Earl E. Devaney.

Data Transparency Coalition Executive Director Hudson Hollister, a former counsel for the House of Representatives and U.S. Securities and Exchange Commission, noted that when the federal government does electronically publish public documents it “often fails to adopt consistent machine-readable identifiers or uniform markup languages.”

Sounds like an opportunity for both the markup and semantic identity communities, topic maps in particular.

Reasoning not only will there need to be mappings between vocabularies and entities but also between “uniform markup languages” as they evolve and develop.

New Paper: Linked Data Strategy for Global Identity

Wednesday, April 4th, 2012

New Paper: Linked Data Strategy for Global Identity

Angela Guess writes:

Hugh Glaser and Harry Halpin have published a new PhD thesis for the University of Southampton Research Repository entitled “The Linked Data Strategy for Global Identity” (2012). The paper was published by the IEEE Computer Society. It is available for download here for non-commercial research purposes only. The abstract states, “The Web’s promise for planet-scale data integration depends on solving the thorny problem of identity: given one or more possible identifiers, how can we determine whether they refer to the same or different things? Here, the authors discuss various ways to deal with the identity problem in the context of linked data.”

At first I was hurt that I didn’t see a copy of Harry’s dissertation before it was published. I don’t always agree with him (see below) but I do like keeping up with his writing.

Then I discovered this is a four page dissertation. I guess Angela never got past the cover page. It is an article in the IEEE zine, IEEE Internet Computing.

Harry fails to mention that the HTTP 303 “trick,” was made necessary by Tim Berners-Lee’s failure to understand the necessity to distinguish identifiers from addresses. Rather that admit to or correct that failure, the solution being pushed is to create web traffic overhead in the form of 303 “tricks.” “303″ should be re-named, “TBL”, so we are reminded with each invocation who made it necessary. (lower middle column, page 3)

I partially agree with:

We’re only just beginning to explore the vast field of identity, and more work is needed before linked data can fulfill its full potential.(on page 5)

The “just beginning” part is true enough. But therein lies the rub. Rather than explore the “…vast field of identity…” which changes from domain to domain, first and then propose a solution, the Linked Data proponents took the other path.

They proposed a solution and in the face of its failure to work, now are inching towards the “…vast field of identity….” Seems a might late for that.

Harry concludes:

The entire bet of the linked data enterprise critically rests on using URIs to create identities for everything. Whether this succeeds might very well determine whether information integration will be trapped in centralized proprietary databases or integrated globally in a decentralized manner with open standards. Given the tremendous amount of data being created and the Web’s ubiquitous nature, URIs and equivalence links might be the best chance we have of solving the identity problem, transforming a profoundly difficult philosophical issue into a concrete engineering project.

The first line, “The entire bet….” omits to say that we need the same URIs for everything. That is called the perfect language project, which has a very long history of consistent failure. Recent attempts include Esperanto and LogLang.

The second line, “Whether this succeeds…trapped in centralized proprietary databases…” is fear mongering. “If you don’t support linked data, (insert your nightmare scenario).”

The final line, “…transforming a profoundly difficult philosophical issue into a concrete engineering project” is magical thinking.

Identity is a very troubled philosophical issue but proposing a solution without understanding the problem doesn’t sound like a high percentage shot to me. You?