Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 3, 2012

20 More Reasons You Need Topic Maps

Filed under: Identification,Identifiers,Identity,Marketing,Topic Maps — Patrick Durusau @ 6:23 pm

Well, Ed Lindsey did call his column 20 Commom Data Errors and Variation but when you see the PNG of the 20 errors, here, you will agree my title works better (for topic maps anyway).

Not only that, but Ed’s opening paragraphs work for identifying a subject by more than one attribute (although this is “subject” in the police sense of the word):

A good friend of mine’s husband is a sergeant on the Chicago police force. Recenlty a crime was committed and a witness insisted that the perpetrator was a woman with blond hair about five nine weighing 160 pounds. She was wearing a gray pinstriped business suit with an Armani scarf and carrying a Gucci handbag.

So what does this sergeant have to do? Start looking at the women of Chicago. He only needs the women. Actually, he would start with women with blond hair (but judging from my daughter’s constant change of hair color he might skip that attribute). So he might start with women in a certain height range and in a certain weight group. He would bring those women in to the station for questioning.

As it turns out, when they finally arrested the woman at her son’s soccer game, she had brown hair, was 5’5″ tall and weighed 120 pounds. She was wearing an Oklahoma University sweatshirt, jeans and sneakers. When the original witness saw her she said yes that’s the same woman. Iit turns out she was wearing four inch heels and the pantsuit made her look bigger.

So what can we learn from this episode that has to do with matching? Well the first thing we need to understand is that each of the attributes of the witness can be used in matching the suspect and then immediately we must also recognize that not all the attributes that the witness gave the sergeant were extremely accurate. So later on when we start talking about matching, will use the term fuzzy matching. This means that when you look at an address, there could be a number of different types of errors in the address from one system that are not identical to an address in another system. Figure 1 shows a number of the common errors that can happen.

So, there you have it: 20 more reasons to use topic maps, a lesson on identifying a subject and proof that yes, a pinstripped pantsuit can make you look bigger.

May 1, 2012

Data Management is Based on Philosophy, Not Science

Filed under: Data Management,Identity,Philosophy — Patrick Durusau @ 4:46 pm

Data Management is Based on Philosophy, Not Science by Malcolm Chisholm.

From the post:

There’s a joke running around on Twitter that the definition of a data scientist is “a data analyst who lives in California.” I’m sure the good natured folks of the Golden State will not object to me bringing this up to make a point. The point is: Thinking purely in terms of marketing, which is a better title — data scientist or data philosopher?

My instincts tell me there is no contest. The term data scientist conjures up an image of a tense, driven individual, surrounded by complex technology in a laboratory somewhere, wrestling valuable secrets out of the strange substance called data. By contrast, the term data philosopher brings to mind a pipe-smoking elderly gentleman sitting in a winged chair in some dusty recess of academia where he occasionally engages in meaningless word games with like-minded individuals.

These stereotypes are obviously crude, but they are probably what would come into the minds of most executive managers. Yet how true are they? I submit that there is a strong case that data management is much more like applied philosophy than it is like applied science.

Applied philosophy. I like that!

You know where I am going to come out on this issue so I won’t belabor it.

Enjoy reading Malcolm’s post!

April 29, 2012

Semantically Diverse Christenings

Filed under: Identity,Names,Semantic Diversity — Patrick Durusau @ 12:09 pm

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

April 28, 2012

Akaros – an open source operating system for manycore architectures

Filed under: Identity,Multi-Core,Parallel Programming — Patrick Durusau @ 6:05 pm

Akaros – an open source operating system for manycore architectures

From the post:

If you are interested in future foward OS designs then you might find Akaros worth a look. It’s an operating system designed for many-core architectures and large-scale SMP systems, with the goals of:

  • Providing better support for parallel and high-performance applications
  • Scaling the operating system to a large number of cores

A more indepth explanation of the motiviation behind Akaros can be found in Improving Per-Node Efficiency in the Datacenter with NewOS Abstractions by Barret Rhoden, Kevin Klues, David Zhu, and Eric Brewer.

From the paper abstract:

Traditional operating system abstractions are ill-suited for high performance and parallel applications, especially on large-scale SMP and many-core architectures. We propose four key ideas that help to overcome these limitations. These ideas are built on a philosophy of exposing as much information to applications as possible and giving them the tools necessary to take advantage of that information to run more efficiently. In short, high-performance applications need to be able to peer through layers of virtualization in the software stack to optimize their behavior. We explore abstractions based on these ideas and discuss how we build them in the context of a new operating system called Akaros.

Rather than “layers of virtualization” I would say: “layers of identifiable subjects.” That’s hardly surprising but it has implications for this paper and future successors on the same issue.

Issues of inefficiency aren’t due to a lack of programming talent, as the authors ably demonstrate, but rather the limitations placed upon that talent by the subjects our operating systems identify and permit to be addressed.

The paper is an exercise in identifying different subjects than those identified in contemporary operating systems. That abstraction may assist future researchers in positing different subjects for identification and consequences that flow from identifying different subjects.

April 27, 2012

Making Search Hard(er)

Filed under: Identity,Searching,Semantics — Patrick Durusau @ 6:10 pm

Rafael Maia posts:

first R, now Julia… are programmers trying on purpose to come up with names for their languages that make it hard to google for info? 😛

I don’t know that two cases prove that programmers are responsible for all the semantic confusion in the world.

A search for FORTRAN produces FORTRAN Formula Translation/Translator.

But compare COBOL:


COBOL Common Business-Oriented Language
COBOL Completely Obsolete Business-Oriented Language 🙂
COBOL Completely Over and Beyond Obvious Logic 🙂
COBOL Compiles Only By Odd Luck 🙂
COBOL Completely Obsolete Burdensome Old Language 🙂

May be something to programmers peeing in the semantic pool.

On the other hand, there are examples prior to programming of semantic overloading of strings.

Here is an interesting question:

Is a string overloaded, semantically speaking, when used or read?

Does your answer impact how you would build a search engine? Why/why not?

April 25, 2012

Just what do you mean by “number”?

Filed under: Humor,Identity — Patrick Durusau @ 6:30 pm

Just what do you mean by “number”?

John D. Cook writes:

Tom Christiansen gave an awesome answer to the question of how to match a number with a regular expression. He begins by clarifying what the reader means by “number”, then gives answers for each.

An eighteen (18) question subset of all the questions about what is meant by “number.”

The simplicity of identity is a polite veneer over boundless complexity.

Most of the time the complexity can remain hidden. Most of the time.

Other times, we are hobbled if our information systems keep us from peeking.

Peeping Tom takes us on an abbreviated tour around the identity of “number.”

April 20, 2012

Past, Present and Future – The Quest to be Understood

Filed under: Identification,Identifiers,Identity — Patrick Durusau @ 6:27 pm

Without restricting it to being machine readable, I think we would all agree there are three ages of data:

  1. Past data
  2. Present data
  3. Future data

And we have common goals for data (or parts of it):

  1. Past data – To understand past data.
  2. Present data – To be understood by others.
  3. Future data – For our present data to persist and be understood by then users.

Common to those ages and goals is the need for management of identifiers for our data. (Where identifiers may be data as well.)

I say “management of identifiers” because we cannot control identifiers used in the past, identifiers used by others in the present, or identifiers that may be used in the future.

You would think in an obviously multi-lingual world that multiple identifier identification would be the default position.

Just a personal observation but hardly a day passes without someone or some group saying the equivalent of:

I know! I will create a list of identifiers that everyone must use! That’s the answer to the confusion (Babel) of identifiers.

Such efforts are always defeated by past identifiers, other identifiers in the present and future identifiers.

Managing tides of identifiers is a partial solution but more workable than trying to stop the tide.

What do you think?

Standardizing Federal Transparency

Filed under: Government Data,Identity,Transparency — Patrick Durusau @ 6:24 pm

Standardizing Federal Transparency

From the post:

A new federal data transparency coalition is pushing for standardization of government documents and support for legislation on public records disclosures, taxpayer spending and business identification codes.

The Data Transparency Coalition announced its official launch Monday, vowing nonpartisan work with Congress and the Executive Branch on ventures toward digital publishing of government documents in a standardized and integrated formats. As part of that effort, the coalition expressed its support of legislative proposals such as: the Digital Accountability and Transparency Act, which would open public spending records published on a single digital format; the Public Information Online Act, which pushes for all records to be released digitally in a machine-readable format; and the Legal Entity Identifier proposal, creating a standard ID code for companies.

The 14 founding members include vendors Microsoft, Teradata, MarkLogic, Rivet Software, Level One Technologies and Synteractive, as well as the Maryland Association of CPAs, financial advisory BrightScope, and data mining and pattern discovery consultancy Elder Research. The coalition board of advisors includes former U.S. Deputy CTO Beth Noveck, data and information services investment firm partner Eric Gillespie and former Recovery Accountability and Transparency Board Chairman Earl E. Devaney.

Data Transparency Coalition Executive Director Hudson Hollister, a former counsel for the House of Representatives and U.S. Securities and Exchange Commission, noted that when the federal government does electronically publish public documents it “often fails to adopt consistent machine-readable identifiers or uniform markup languages.”

Sounds like an opportunity for both the markup and semantic identity communities, topic maps in particular.

Reasoning not only will there need to be mappings between vocabularies and entities but also between “uniform markup languages” as they evolve and develop.

April 4, 2012

New Paper: Linked Data Strategy for Global Identity

Filed under: Identity,RDF,Semantic Web — Patrick Durusau @ 3:32 pm

New Paper: Linked Data Strategy for Global Identity

Angela Guess writes:

Hugh Glaser and Harry Halpin have published a new PhD thesis for the University of Southampton Research Repository entitled “The Linked Data Strategy for Global Identity” (2012). The paper was published by the IEEE Computer Society. It is available for download here for non-commercial research purposes only. The abstract states, “The Web’s promise for planet-scale data integration depends on solving the thorny problem of identity: given one or more possible identifiers, how can we determine whether they refer to the same or different things? Here, the authors discuss various ways to deal with the identity problem in the context of linked data.”

At first I was hurt that I didn’t see a copy of Harry’s dissertation before it was published. I don’t always agree with him (see below) but I do like keeping up with his writing.

Then I discovered this is a four page dissertation. I guess Angela never got past the cover page. It is an article in the IEEE zine, IEEE Internet Computing.

Harry fails to mention that the HTTP 303 “trick,” was made necessary by Tim Berners-Lee’s failure to understand the necessity to distinguish identifiers from addresses. Rather that admit to or correct that failure, the solution being pushed is to create web traffic overhead in the form of 303 “tricks.” “303” should be re-named, “TBL”, so we are reminded with each invocation who made it necessary. (lower middle column, page 3)

I partially agree with:

We’re only just beginning to explore the vast field of identity, and more work is needed before linked data can fulfill its full potential.(on page 5)

The “just beginning” part is true enough. But therein lies the rub. Rather than explore the “…vast field of identity…” which changes from domain to domain, first and then propose a solution, the Linked Data proponents took the other path.

They proposed a solution and in the face of its failure to work, now are inching towards the “…vast field of identity….” Seems a might late for that.

Harry concludes:

The entire bet of the linked data enterprise critically rests on using URIs to create identities for everything. Whether this succeeds might very well determine whether information integration will be trapped in centralized proprietary databases or integrated globally in a decentralized manner with open standards. Given the tremendous amount of data being created and the Web’s ubiquitous nature, URIs and equivalence links might be the best chance we have of solving the identity problem, transforming a profoundly difficult philosophical issue into a concrete engineering project.

The first line, “The entire bet….” omits to say that we need the same URIs for everything. That is called the perfect language project, which has a very long history of consistent failure. Recent attempts include Esperanto and LogLang.

The second line, “Whether this succeeds…trapped in centralized proprietary databases…” is fear mongering. “If you don’t support linked data, (insert your nightmare scenario).”

The final line, “…transforming a profoundly difficult philosophical issue into a concrete engineering project” is magical thinking.

Identity is a very troubled philosophical issue but proposing a solution without understanding the problem doesn’t sound like a high percentage shot to me. You?

March 28, 2012

Once Upon A Subject Clearly…

Filed under: Identity,Marketing,Subject Identity — Patrick Durusau @ 4:22 pm

As I was writing up the GWAS Central post, the question occurred to me: does their mapping of identifiers take something away from topic maps?

My answer is no and I would like to say why if you have a couple of minutes. 😉 Seriously! It isn’t going to take that long. However long it has taken me to reach this point.

Every time we talk, write or otherwise communicate about a subject, we at the same time have identified that subject. Makes sense. We want whoever we are talking, writing to or communicating with, to understand what we are talking about. Hard to do if we don’t identify what subject(s) we are talking about.

We do it all day, every day. In public, in private, in semi-public places. 😉 And we use words to do it. To identify the subjects we are talking about.

For the most part, or at least fairly often, we are understood by other people. Not always, but most of the time.

The problem comes in when we start to gather up information from different people who may (or may not) use words differently than we do. So there is a much larger chance that we don’t mean the same thing by the same words. Or we may use different words to mean the same thing.

Words, which were our reliable servants for the most part, become far less reliable.

To counter that unreliability, we can create groups of words, mappings if you like, to keep track of what words go where. But, to do that, we have to use words, again.

Start to see the problem? We always use words, to clear up our difficulties with words. And there isn’t any universal stopping place. The Cyc advocates would have us stop there and the SUMO crowd would have us stop over there and the Semantic Web folks yet somewhere else and of course the topic map mavens, yet one or more places.

For some purposes, any one or more of those mappings may be adequate. A mapping is only as good and for as long as it is useful.

History tells us that every mapping will be replaced with other mappings. We would do well us understand/document the words we are using as part of our mappings, as well as we are able.

But if words are used to map words, where do we stop? My suggestion would be to stop as we always have, wherever looks convenient. So long as the mapping suits your present purposes, what more would you ask of it?

I am quite content to have such stopping places because it means we will always have more starting places for the next round of mapping!

Ironic isn’t it? We create mappings to make sense out of words and our words lay the foundation for others to do the same.

March 15, 2012

Data and Reality

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

February 19, 2012

Identity – The Philosophical Challenge For the Web

Filed under: Identity,Subject Identifiers,Subject Identity — Patrick Durusau @ 8:35 pm

Identity – The Philosophical Challenge For the Web by Matthew Hurst.

From the post:

I work in local search at Microsoft which means, like all those working in this space, I have to deal with an identity crisis on a daily basis. Currently, most local search products – like Bing’s and Google’s – leverage multiple data sets to derive a digital model of the world that users can then interact with. In creating this digital model, multiple statements have to be conflated to form a unified representation. This can be extremely challenging for two reasons. Firstly, the system has to decided when two records are intended to denote the same real world entity. Secondly, the designers of the system have to determine what real world entities are and how to describe them.

For example, if a business moves is that the same business or the closure of one and the opening of another? What does it mean to categorize a business? The cafe in Barnes and Noble is branded Starbucks but isn’t actually part of the Starbucks chain – should is surface as a separate entity or is it ‘hidden’ within the bookshop as an attribute (‘has cafe’)?

Thinking through these hard representational problems is as much part of the transformative trends going on in the tech industry as are those characterized by terms like ‘big data’ and ‘data scientist’.

Questions of identity and how to resolve different multiple references to the same entity have been debated at least since the time of Greek philosophers. Identity (Wikipedia page, see references on the various pages.)

This “philosophical challenge” has been going on for a very long time and so far I haven’t seen any demonstrations that the Web raises new questions.

You need to read Matthew’s identity example in his post.

The songs in question could be said to be instances of the same subject and a reference to that subject would be satisfied with any of those instances. From another point of view, the origin of the instances could be said to distinguish them into different subjects, say for proof of licensing purposes. Other view points are possible. Depends upon the purpose of your criteria of identification.

January 8, 2012

Nice article on predictive analytics in insurance

Filed under: Identity,Insurance,Marketing,Predictive Analytics — Patrick Durusau @ 7:11 pm

Nice article on predictive analytics in insurance

James Taylor writes:

Patrick Sugent wrote a nice article on A Predictive Analytics Arsenal in claims magazine recently. The article is worth a read and, if this is a topic that interests you check out our white paper on next generation claims systems or the series of blog posts on decision management in insurance that I wrote after I did a webinar with Deb Smallwood (an insurance industry expert quoted in the article).

The article is nice but I thought the white paper was better. Particularly this passage:

Next generation claims systems with Decision Management focus on the decisions in the claims process. These decisions are managed as reusable assets and made widely available to all channels, processes and systems via Decision Services. A decision-centric approach enables claims feedback and experience to be integrated into the whole product life cycle and brings the company’s know-how and expertise to bear at every step in the claims process.

At the heart of this new mindset is an approach for replacing decision points with Decision Services and improving business performance by identifying the key decisions that drive value in the business and improving on those decisions by leveraging a company’s expertise, data and existing systems.

Insurers are adopting Decision Management to build next generation claims systems that improve claims processes.

In topic map lingo, “next generation claims systems” are going to treat decisions as subjects that can be identified and re-used to improve the process.

Decisions are made everyday in claims processing but, current systems don’t identify them as subjects and so re-use simply isn’t possible.

True enough the proposal in the white paper does not allow for merging of decisions identified by others, but that doesn’t look like a requirement in their case. They need to be able to identify decisions they make and feed them back into their systems.

The other thing I liked about the white paper was the recognition that hard coding decision rules by IT is a bad idea. (full stop) You can take that one to the bank.

Of course, remember what James says about changes:

Most policies and regulations are written up as requirements and then hard-coded after waiting in the IT queue, making changes slow and costly.

But he omits that hard-coding empowers IT because any changes have to come to IT for implementation.

Making changes possible by someone other than IT, will empower that someone else and diminish IT.

Who knows what and when do they get to know it is a question of power.

Topic maps and other means of documentation/disclosure, have the potential to shift balances of power in an organization.

May as well say that up front so we can start identifying the players, who will cooperate, who will resist. And experimenting with what might work as incentives to promote cooperation. Which can be measured just like you measure other processes in a business.

December 13, 2011

Ontology Matching 2011

Filed under: Identification,Identity,Ontology — Patrick Durusau @ 9:54 pm

Ontology Matching 2011

Proceedings of the 6th International Workshop on Ontology Matching (OM-2011)

From the conference website:

Ontology matching is a key interoperability enabler for the Semantic Web, as well as a useful tactic in some classical data integration tasks dealing with the semantic heterogeneity problem. It takes the ontologies as input and determines as output an alignment, that is, a set of correspondences between the semantically related entities of those ontologies. These correspondences can be used for various tasks, such as ontology merging, data translation, query answering or navigation on the web of data. Thus, matching ontologies enables the knowledge and data expressed in the matched ontologies to interoperate.


The workshop has three goals:

  • To bring together leaders from academia, industry and user institutions to assess how academic advances are addressing real-world requirements. The workshop will strive to improve academic awareness of industrial and final user needs, and therefore direct research towards those needs. Simultaneously, the workshop will serve to inform industry and user representatives about existing research efforts that may meet their requirements. The workshop will also investigate how the ontology matching technology is going to evolve.
  • To conduct an extensive and rigorous evaluation of ontology matching approaches through the OAEI (Ontology Alignment Evaluation Initiative) 2011 campaign. The particular focus of this year’s OAEI campaign is on real-world specific matching tasks involving, e.g., open linked data and biomedical ontologies. Therefore, the ontology matching evaluation initiative itself will provide a solid ground for discussion of how well the current approaches are meeting business needs.
  • To examine similarities and differences from database schema matching, which has received decades of attention but is just beginning to transition to mainstream tools.

An excellent set of papers and posters.

While I was writing this post, I realized that had the papers been described as matching subject identifications by similarity measures, I would have felt completely different about the papers.

Isn’t that odd?

Question: Do you agree/disagree that mapping ontologies is different from mapping subject identifications? Why/why not?

September 2, 2011

Category-Based Routing in Social Networks:…

Filed under: Identity,Networks,Social Networks — Patrick Durusau @ 7:58 pm

Category-Based Routing in Social Networks: Membership Dimension and the Small-World Phenomenon (Short) by David Eppstein, Michael T. Goodrich, Maarten Löffler, Darren Strash, and Lowell Trott.

Abstract:

A classic experiment by Milgram shows that individuals can route messages along short paths in social networks, given only simple categorical information about recipients (such as “he is a prominent lawyer in Boston” or “she is a Freshman sociology major at Harvard”). That is, these networks have very short paths between pairs of nodes (the so-called small-world phenomenon); moreover, participants are able to route messages along these paths even though each person is only aware of a small part of the network topology. Some sociologists conjecture that participants in such scenarios use a greedy routing strategy in which they forward messages to acquaintances that have more categories in common with the recipient than they do, and similar strategies have recently been proposed for routing messages in dynamic ad-hoc networks of mobile devices. In this paper, we introduce a network property called membership dimension, which characterizes the cognitive load required to maintain relationships between participants and categories in a social network. We show that any connected network has a system of categories that will support greedy routing, but that these categories can be made to have small membership dimension if and only if the underlying network exhibits the small-world phenomenon.

So, if identity is a social construct and the result of small-world networks, then we may need a different kind of precision (from scientific measurement) to identify subjects.

Perhaps the reverse of 20-questions, how many questions do we need for a particular subject? Does anyone remember if there was a common number of questions that were sufficient for the 20-questions game?

August 29, 2011

Persistent Data Structures and Managed References

Filed under: Clojure,Identity — Patrick Durusau @ 6:30 pm

Persistent Data Structures and Managed References: Clojure’s approach to Identity and State by Rich Hickey.

From the summary:

Rich Hickey’ presentation is organized around a number of programming concepts: identity, state and values. He explains how to represent composite objects as values and how to deal with change and state, as it is implemented in Clojure.

OK, it’s not recent, circa 2009, but it is quite interesting.

Some tidbits to entice you to watch the presentation:

Identity – A logical entity we associate with a series of causally related values (states) over time.

Represent objects as composite values

Persistent Data Structures preserves old values as immutable

Bit-partitioned hash tries 32-bit

Structural sharing – path copying

Persistent data structures provide efficient immutable composite values

When I saw the path copying operations that efficiently maintain immutable values I immediately thought of Steve Newcomb and Versavant. 😉

« Newer Posts

Powered by WordPress