Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 25, 2015

Everyone is an IA [Information Architecture]

Filed under: Information Architecture,Subject Identity,TMRM — Patrick Durusau @ 4:32 pm

Everyone is an IA [Information Architecture] by Dan Ramsden.

From the post:

This is a post inspired by my talk from World IA Day. On the day I had 20 minutes to fill – I did a magic trick and talked about an imaginary uncle. This post has the benefit of an edit, but recreates the central argument – everyone makes IA.

Information architecture is everywhere, it’s a part of every project, every design includes it. But I think there’s often a perception that because it requires a level of specialization to do the most complicated types of IA, people are nervous about how and when they engage with it – no-one like to look out of their depth. And some IA requires a depth of thinking that deserves justification and explanation.

Even when you’ve built up trust with teams of other disciplines or clients, I think one of the most regular questions asked of an IA is probably, ‘Is it really that complicated?’ And if we want to be happier in ourselves, and spread happiness by creating meaningful, beautiful, wonderful things – we need to convince people that complex is different from complicated. We need to share our conviction that IA is a real thing and that thinking like an IA is probably one of the most effective ways of contributing to a more meaningful world.

But we have a challenge, IAs are usualy the minority. At the BBC we have a team of about 140 in UX&D, and IAs are the minority – we’re not quite 10%. It’s my job to work out how those less than 1 in 10 can be as effective as possible and have the biggest positive impact on the work we do and the experiences we offer to our audiences. I don’t think this is unique. A lot of the time IAs don’t work together, or there’s not enough IAs to work on every project that could benefit from an IA mindset, which is every project.

This is what troubled me. How could I make sure that it is always designed? My solution to this is simple. We become the majority. And because we can’t do that just by recruiting a legion of IAs we do it another way. We turn everyone in the team into an information architect.

Now this is a bit contentious. There’s legitimate certainty that IA is a specialism and that there are dangers of diluting it. But last year I talked about an IA mindset, a way of approaching any design challenge from an IA perspective. My point then was that the way we tend to think and therefore approach design challenges is usually a bit different from other designers. But I don’t believe we’re that special. I think other people can adopt that mindset and think a little bit more like we do. I think if we work hard enough we can find ways to help designers to adopt that IA mindset more regularly.

And we know the benefits on offer when every design starts from the architecture up. Well-architected things work better. They are more efficient, connected, resilient and meaningful – they’re more useful.

Dan goes onto say that information is everywhere. Much in the same way that I would say that subjects are everywhere.

Just as users must describe information architectures as they experience them, the same is true for users identifying the subjects that are important to them.

There is never a doubt that more IAs and more subjects exist, but the best anyone can do is to tell you about the ones that are important to them and how they have chosen to identify them.

To no small degree, I think terminology has been used to disenfranchise users from discussing subjects as they understand them.

From my own background, I remember a database project where the head of membership services, who ran reports by rote out of R&R, insisted on saying where data needed to reside in tables during a complete re-write of the database. I keep trying, with little success, to get them to describe what they wanted to store and what capabilities they needed.

In retrospect, I should have allowed membership services to use their terminology to describe the database because whether they understood the underlying data architecture or not wasn’t a design goal. The easier course would have been to provide them with a view that accorded with their idea of the database structure and to run their reports. That other “views” of the data existed would have been neither here nor there to them.

As “experts,” we should listen to the description of information architectures and/or identifications of subjects and their relationships as a voyage of discovery. We are discovering the way someone else views the world, not for our correction to the “right” way but so we can enable their view to be more productive and useful to them.

That approach takes more work on the part of “experts” but think of all the things you will learn along the way.

November 5, 2014

AMR: Not semantics, but close (? maybe ???)

Filed under: Semantics,Subject Identity — Patrick Durusau @ 7:37 pm

AMR: Not semantics, but close (? maybe ???) by Hal Daumé.

From the post:

Okay, necessary warning. I’m not a semanticist. I’m not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I’ve been excited about AMR (the “Abstract Meaning Representation”) recently. It’s hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself “Meaning.” I effluviated about AMR in the context of the (awesome) SemEval panel.

There is an LREC paper this year whose title is where I stole the title of this post from: Not an Interlingua, But Close: A Comparison of English AMRs to Chinese and Czech by Xue, Bojar, Hajič, Palmer, Urešová and Zhang. It’s a great introduction to AMR and you should read it (at least skim).

What I guess I’m interested in discussing is not the question of whether AMR is a good interlingua but whether it’s a semantic representation. Note that it doesn’t claim this: it’s not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it’s a reasonable place to look; see Emily Bender’s comment.] it’s probably the closest we have.

Deeply interesting work, particularly given the recent interest in Enhancing open data with identifiers. Be sure to read the comments to the post as well.

Who knew? Semantics are important!

😉

Topic maps take that a step further and capture your semantics, not necessarily the semantics of some expert unfamiliar with your domain.

October 27, 2014

Data Modelling: The Thin Model [Entities with only identifiers]

Filed under: Data Models,Subject Identifiers,Subject Identity,Subject Recognition — Patrick Durusau @ 3:57 pm

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

September 10, 2014

What’s in a Name?

Filed under: Conferences,Names,Subject Identity — Patrick Durusau @ 10:56 am

What’s in a Name?

From the webpage:

What will be covered? The meeting will focus on the role of chemical nomenclature and terminology in open innovation and communication. A discussion of areas of nomenclature and terminology where there are fundamental issues, how computer software helps and hinders, the need for clarity and unambiguous definitions for application to software systems. How can you contribute? As well as the talks from expert speakers there will be plenty of opportunity for discussion and networking. A record will be made of the meeting, including the discussion, and will be made available initially to those attending the meeting. The detailed programme and names of speakers will be available closer to the date of the meeting.

Date: 21 October 2014

Event Subject(s): Industry & Technology

Venue

The Royal Society of Chemistry
Library
Burlington House
Piccadilly
London
W1J 0BA
United Kingdom

Find this location using Google Map

Contact for Event Information

Name: Prof Jeremy Frey

Address:
Chemistry
University of Southampton
United Kingdom

Email: j.g.frey@soton.ac.uk

Now there’s an event worth the hassle of overseas travel during these paranoid times! Alas, I will have to wait for the conference record to be released to non-attendees. The event is a good example of the work going on at the Royal Society of Chemistry.

I first saw this in a tweet by Open PHACTS.

July 20, 2014

German Record Linkage Center

Filed under: Record Linkage,Subject Identity — Patrick Durusau @ 6:40 pm

German Record Linkage Center

From the webpage:

The German Record Linkage Center (GermanRLC) was established in 2011 to promote research on record linkage and to facilitate practical applications in Germany. The Center will provide several services related to record linkage applications as well as conduct research on central topics of the field. The services of the GermanRLC are open to all academic disciplines.

Wikipedia describes record linkage as:

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.

While very similar to topic maps, record linkage relies upon the creation of a common record for further processing, as opposed to pointing into an infoverse to identify subjects in their natural surroundings.

Another difference in practice is that the subjects (headers, fields, etc.) that contain subjects are not themselves treated as subjects with identity. That is to say that how a mapping from an original form was made to the target form is opaque to a subsequent researcher.

I first saw this in a tweet by Lars Marius Garshol.

July 18, 2014

Build Roads not Stagecoaches

Filed under: Data,Integration,Subject Identity — Patrick Durusau @ 3:40 pm

Build Roads not Stagecoaches by Martin Fenner.

Describing Eric Hysen’s keynote, Martin says:

In his keynote he described how travel from Cambridge to London in the 18th and early 19th century improved mainly as a result of better roads, made possible by changes in how these roads are financed. Translated to today, he urged the audience to think more about the infrastructure and less about the end products:

Ecosystems, not apps

— Eric Hysen

On Tuesday at csv,conf, Nick Stenning – Technical Director of the Open Knowledge Foundation – talked about data packages, an evolving standard to describe data that are passed around betwen different systems. He used the metaphor of containers, and how they have dramatically changed the transportation of goods in the last 50 years. He argued that the cost of shipping was in large part determined by the cost of loading and unloading, and the container has dramatically changed that equation. We are in a very similar situation with datasets, where most of the time is spent translating between different formats, joining things together that use different names for the same thing [emphasis added], etc.

…different names for the same thing.

Have you heard that before? 😉

But here is the irony:

When I thought more about this I realized that these building blocks are exactly the projects I get most excited about, i.e. projects that develop standards or provide APIs or libraries. Some examples would be

  • ORCID: unique identifiers for scholarly authors

OK, but many authors already have unique identifiers in DBLP, Library of Congress, Twitter, and at places I have not listed.

Nothing against ORCID, but adding yet another identifier isn’t all that helpful.

A mapping between identifiers, so having one means I can leverage the others, now that is what I call infrastructure.

You?

June 27, 2014

Communicating and resolving entity references

Filed under: Communication,Entity Resolution,Shannon,Subject Identity — Patrick Durusau @ 1:17 pm

Communicating and resolving entity references by R.V. Guha.

Abstract:

Statements about entities occur everywhere, from newspapers and web pages to structured databases. Correlating references to entities across systems that use different identifiers or names for them is a widespread problem. In this paper, we show how shared knowledge between systems can be used to solve this problem. We present “reference by description”, a formal model for resolving references. We provide some results on the conditions under which a randomly chosen entity in one system can, with high probability, be mapped to the same entity in a different system.

An eye appointment is going to prevent me from reading this paper closely today.

From a quick scan, do you think Guha is making a distinction between entities and subjects (in the topic map sense)?

What do you make of literals having no identity beyond their encoding? (page 4, #3)

Redundant descriptions? (page 7) Would you say that defining a set of properties that must match would qualify? (Or even just additional subject indicators?)

Expect to see a lot more comments on this paper.

Enjoy!

I first saw this in a tweet by Stefano Bertolo.

June 15, 2014

What You Thought The Supreme Court…

Filed under: Law,Law - Sources,Legal Informatics,Subject Identity — Patrick Durusau @ 3:45 pm

Clever piece of code exposes hidden changes to Supreme Court opinions by Jeff John Roberts.

From the post:

Supreme Court opinions are the law of the land, and so it’s a problem when the Justices change the words of the decisions without telling anyone. This happens on a regular basis, but fortunately a lawyer in Washington appears to have just found a solution.

The issue, as Adam Liptak explained in the New York Times, is that original statements by the Justices about everything from EPA policy to American Jewish communities, are disappearing from decisions — and being replaced by new language that says something entirely different. As you can imagine, this is a problem for lawyers, scholars, journalists and everyone else who relies on Supreme Court opinions.

Until now, the only way to detect when a decision has been altered is a pain-staking comparison of earlier and later copies — provided, of course, that someone knew a decision had been changed in the first place. Thanks to a simple Twitter tool, the process may become much easier.

See Jeff’s post for more details, including a twitter account to follow the discovery of changes in opinions in the opinions of the Supreme Court of the United States.

In a nutshell, the court issues “slip” opinions in cases they decide and then later, sometimes years later, they provide a small group of publishers of their opinions with changes to be made to those opinions.

Which means the opinion you read as a “slip” opinion or in an advance sheet (paper back issue that is followed by a hard copy volume combining one or more advance sheets), may not be the opinion of record down the road.

Two questions occur to me immediately:

  1. We can distinguish the “slip” opinion version of an opinion from the “final” published opinion, but how do we distinguish a “final” published decision from a later “more final” published decision? Given the stakes at hand in proceedings before the Supreme Court, certainty about the prior opinions of the Court is very important.
  2. While the Supreme Court always gets most of the attention, it occurs to me that the same process of silent correction has been going on for other courts with published opinions, such as the United States Courts of Appeal and the United States District Courts. Perhaps for the last century or more.

    Which makes it only a small step to ask about state supreme courts and their courts of appeal. What is their record on silent correction of opinions?

There are mechanical difficulties the older records become because the “slip” opinions may be lost to history but in terms of volume, that would certainly be a “big data” project for legal informatics. To discover and document the behavior of courts over time with regard to silent correction of opinions.

What you thought the Supreme Court said may not be what our current record reflects. Who wins? What you heard or what a silently corrected record reports?

May 25, 2014

Emotion Markup Language 1.0 (No Repeat of RDF Mistake)

Filed under: EmotionML,Subject Identity,Topic Maps,W3C — Patrick Durusau @ 3:19 pm

Emotion Markup Language (EmotionML) 1.0

Abstract:

As the Web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. The specification of Emotion Markup Language 1.0 aims to strike a balance between practical applicability and scientific well-foundedness. The language is conceived as a “plug-in” language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

I started reading EmotionML with the expectation that the W3C had repeated its one way and one way only for identification mistake from RDF.

Much to my pleasant surprise I found:

1.2 The challenge of defining a generally usable Emotion Markup Language

Any attempt to standardize the description of emotions using a finite set of fixed descriptors is doomed to failure: even scientists cannot agree on the number of relevant emotions, or on the names that should be given to them. Even more basically, the list of emotion-related states that should be distinguished varies depending on the application domain and the aspect of emotions to be focused. Basically, the vocabulary needed depends on the context of use. On the other hand, the basic structure of concepts is less controversial: it is generally agreed that emotions involve triggers, appraisals, feelings, expressive behavior including physiological changes, and action tendencies; emotions in their entirety can be described in terms of categories or a small
number of dimensions; emotions have an intensity, and so on. For details, see Scientific Descriptions of Emotions in the Final Report of the Emotion Incubator Group.

Given this lack of agreement on descriptors in the field, the only practical way of defining an EmotionML is the definition of possible structural elements and their valid child elements and attributes, but to allow users to “plug in” vocabularies that they consider appropriate for their work. A separate W3C Working Draft complements this specification to provide a central repository of [Vocabularies for EmotionML] which can serve as a starting point; where the vocabularies listed there seem inappropriate, users can create their custom vocabularies.

An additional challenge lies in the aim to provide a generally usable markup, as the requirements arising from the three different use cases (annotation, recognition, and generation) are rather different. Whereas manual annotation tends to require all the fine-grained distinctions considered in the scientific literature, automatic recognition systems can usually distinguish
only a very small number of different states.

For the reasons outlined here, it is clear that there is an inevitable tension between flexibility and interoperability, which need to be weighed in the formulation of an EmotionML. The guiding principle in the following specification has been to provide a choice only where it is needed, and to propose reasonable default options for every choice.

Everything that is said about emotions is equally true for identification, emotions being on one of the infinite sets of subjects that you might want to identify.

Had the W3C avoided the one identifier scheme of RDF (and the reliance on a subset of reasoning, logic), RDF could have had plugin “identifier” modules, enabling the use of all extant and future identifiers, not to mention “reasoning” according to the designs of users.

It is good to see the W3C learning from its earlier mistakes and enabling users to express their world views, as opposed to a world view as prescribed by the W3C.

When users declare their emotional vocabularies, those are subjects which merit further identification. To avoid the problem of us not meaning the same thing by “owl:sameAs” as someone else means by “owl:sameAs.” (When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, Patrick J. Hayes.)

Topic maps are a good solution for documenting subject identity and deciding when two or more identifications of subjects are the same subject.

I first saw this in a tweet by Inge Henriksen

May 9, 2014

…Locality Sensitive Hashing for Unstructured Data

Filed under: Hashing,Jaccard Similarity,Similarity,Subject Identity — Patrick Durusau @ 6:51 pm

Practical Applications of Locality Sensitive Hashing for Unstructured Data by Jake Drew.

From the post:

The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the search time typically required in high dimensional spaces when finding similar items. Locality Sensitive Hashing accomplishes this efficiency by exponentially reducing the amount of data required for storage when collecting features for comparison between similar item sets. In other words, Locality Sensitive Hashing successfully reduces a high dimensional feature space while still retaining a random permutation of relevant features which research has shown can be used between data sets to determine an accurate approximation of Jaccard similarity [2,3].

Complete with code and references no less!

How “similar” do two items need to be to count as the same item?

If two libraries own a physical copy of the same book, for some purposes they are distinct items but for annotations/reviews, you could treat them as one item.

If that sounds like a topic map-like question, your right!

What measures of similarity are you applying to what subjects?

April 24, 2014

We have no “yellow curved fruit” today

Filed under: Humor,Names,Subject Identity — Patrick Durusau @ 8:18 pm

banana

Tweeted by Olivier Croisier with this comment:

Looks like naming things is hard not only in computer science…

Naming (read identity) problems are everywhere.

Our intellectual cocoons prevent us noticing such problems very often.

At least until something goes terribly wrong. Then the hunt is on for a scapegoat, not an explanation.

April 17, 2014

Expert vs. Volunteer Semantics

Filed under: Authoring Topic Maps,Crowd Sourcing,Subject Identity,Topic Maps — Patrick Durusau @ 10:47 am

The variability of crater identification among expert and community crater analysts by Stuart J. Robbins, et al.

Abstract:

The identification of impact craters on planetary surfaces provides important information about their geological history. Most studies have relied on individual analysts who map and identify craters and interpret crater statistics. However, little work has been done to determine how the counts vary as a function of technique, terrain, or between researchers. Furthermore, several novel internet-based projects ask volunteers with little to no training to identify craters, and it was unclear how their results compare against the typical professional researcher. To better understand the variation among experts and to compare with volunteers, eight professional researchers have identified impact features in two separate regions of the moon. Small craters (diameters ranging from 10 m to 500 m) were measured on a lunar mare region and larger craters (100s m to a few km in diameter) were measured on both lunar highlands and maria. Volunteer data were collected for the small craters on the mare. Our comparison shows that the level of agreement among experts depends on crater diameter, number of craters per diameter bin, and terrain type, with differences of up to ∼±45. We also found artifacts near the minimum crater diameter that was studied. These results indicate that caution must be used in most cases when interpreting small variations in crater size-frequency distributions and for craters ≤10 pixels across. Because of the natural variability found, projects that emphasize many people identifying craters on the same area and using a consensus result are likely to yield the most consistent and robust information.

The identification of craters on the Moon may seem far removed from your topic map authoring concerns but I would suggest otherwise.

True the paper is domain specific in some of it concerns (crater age, degradation, etc.) but the most important question was whether volunteers in aggregate could be as useful as experts in the identification of craters?

The author conclude:

Except near the minimum diameter, volunteers are able to identify craters just as well as the experts (on average) when using the same interface (the Moon Mappers interface), resulting in not only a similar number of craters, but also a similar size distribution. (page 34)

I find that suggestive for mapping semantics because unlike moon craters, what words mean (and implicitly why) are a daily concern for users, including ones in your enterprise.

You can, of course, employ experts to re-interpret what they have been told by some of your users into the expert’s language and produce semantic integration based on the expert’s understanding or mis-understanding of your domain.

Or, you can use your own staff, with experts to facilitate encoding their understanding of your enterprise semantics, as in a topic map.

Recalling that the semantics for your enterprise aren’t “out there” in the ether but residing within the staff that make up your enterprise.

I still see an important role for experts but it isn’t as the source of your semantics, rather at the hunters who assist in capturing your semantics.

I first saw this in a tweet by astrobites that lead me to: Crowd-Sourcing Crater Identification by Brett Deaton.

April 16, 2014

‘immersive intelligence’ [Topic Map-like application]

Filed under: Intelligence,Subject Identity,Topic Maps — Patrick Durusau @ 10:03 am

Long: NGA is moving toward ‘immersive intelligence’ by Sean Lyngaas.

From the post:

Of the 17 U.S. intelligence agencies, the National Geospatial-Intelligence Agency is best suited to turn big data into actionable intelligence, NGA Director Letitia Long said. She told FCW in an April 14 interview that mapping is what her 14,500-person agency does, and every iota of intelligence can be attributed to some physical point on Earth.

“We really are the driver for intelligence integration because everything is somewhere on the Earth at a point in time,” Long said. “So we give that ability for all of us who are describing objects to anchor it to the Map of the World.”

NGA’s Map of the World entails much more minute information than the simple cartography the phrase might suggest. It is a mix of information from top-secret, classified and unclassified networks made available to U.S. government agencies, some of their international partners, commercial users and academic experts. The Map of the World can tap into a vast trove of satellite and social media data, among other sources.

NGA has made steady progress in developing the map, Long said. Nine data layers are online and available now, including those for maritime and aeronautical data. A topography layer will be added in the next two weeks, and two more layers will round out the first operational version of the map in August.

Not surprisingly, the National Geospatial-Intelligence Agency sees geography as the organizing principal for intelligence integration. Or as as NGA Director Long says: “…everything is somewhere on the Earth at a point in time.” I can’t argue with the accuracy of that statement, save for extraterrestrial events, satellites, space-based weapons, etc.

On the other hand, you could gather intelligence by point of origin, places referenced, people mentioned (their usual locations), etc., in languages spoken by more than thirty (30) million people and you could have a sack with intelligence in forty (40) languages. List of languages by number of native speakers

When I say “topic map-like” application, I mean that the NGA has chosen geographic locations as the organizing principle for intelligence as opposed to using subjects as the organizing principle for intelligence, of which geographic location is only one type. Noting that with a broader organizing principle, it would be easier to integrate data from other agencies who have their own organizational principles for the intelligence they gather.

I like the idea of “layers” as described in the post. In part because a topic map can exist as an additional layer on top of the current NGA layers to integrate other intelligence data on a subject basis with the geographic location system of the NGA.

Think of topic maps as being “in addition to” and not “instead of” your current integration technology.

What’s your principle for organizing intelligence? Would it be useful to integrate data organized around other principles for organizing intelligence? And still find the way back to the original data?

PS: Do you remember the management book “Who Moved My Cheese?” Moving intelligence from one system to another can result in: “Who Moved My Intelligence?,” when it can no longer be discovered by its originator. Not to mention the intelligence will lack the context of its point of origin.

April 3, 2014

Developing a 21st Century Global Library for Mathematics Research

Filed under: Identification,Identifiers,Identity,Mathematics,Subject Identity — Patrick Durusau @ 8:58 pm

Developing a 21st Century Global Library for Mathematics Research by Committee on Planning a Global Library of the Mathematical Sciences.

Care to guess what one of the major problems facing mathematical research might be?

Currently, there are no satisfactory indexes of many mathematical objects, including symbols and their uses, formulas, equations, theorems, and proofs, and systematically labeling them is challenging and, as of yet, unsolved. In many fields where there are more specialized objects (such as groups, rings, fields), there are community efforts to index these, but they are typically not machine-readable, reusable, or easily integrated with other tools and are often lacking editorial efforts. So, the issue is how to identify existing lists that are useful and valuable and provide some central guidance for further development and maintenance of such lists. (p. 26)

Does that surprise you?

What do you think the odds are of mathematical research slowing down enough for committees to decide on universal identifiers for all the subjects in mathematical publications?

That’s about what I thought.

I have a different solution: Why not ask mathematicians who are submitting articles for publication to identity (specify properties for) what they consider to be the important subjects in their article?

The authors have the knowledge and skill, not to mention the motivation of wanting their research to be easily found by others.

Over time I suspect that particular fields will develop standard identifications (sets of properties per subject) that mathematicians can reuse to save themselves time when publishing.

Mappings across those sets of properties will be needed but that can be the task of journals, researchers and indexers who have an interest and skill in that sort of enterprise.

As opposed to having a “boil the ocean” approach that tries to do more than any one project is capable of doing competently.

Distributed subject identification is one way to think about it. We already do it, this would be a semi-formalization of that process and writing down what each author already knows.

Thoughts?

PS: I suspect the condition recited above is true for almost any sufficiently large field of study. A set of 150 million entities sounds large only without context. In the context of of science, it is a trivial number of entities.

March 19, 2014

Search Gets Smarter with Identifiers

Filed under: EU,Identifiers,Subject Identifiers,Subject Identity — Patrick Durusau @ 3:36 pm

Search Gets Smarter with Identifiers

From the post:

The future of computing is based on Big Data. The vast collections of information available on the web and in the cloud could help prevent the next financial crisis, or even tell you exactly when your bus is due. The key lies in giving everything – whether it’s a person, business or product – a unique identifier.

Imagine if everything you owned or used had a unique code that you could scan, and that would bring you a wealth of information. Creating a database of billions of unique identifiers could revolutionise the way we think about objects. For example, if every product that you buy can be traced through every step in the supply chain you can check whether your food has really come from an organic farm or whether your car is subject to an emergency recall.

….

The difficulty with using big data is that the person or business named in one database might have a completely different name somewhere else. For example, news reports talk about Barack Obama, The US President, and The White House interchangeably. For a human being, it’s easy to know that these names all refer to the same person, but computers don’t know how to make these connections. To address the problem, Okkam has created a Global Open Naming System: essentially an index of unique entities like people, organisations and products, that lets people share data.

“We provide a very fast and effective way of discovering data about the same entities across a variety of sources. We do it very quickly,” says Paolo Bouquet. “And we do it in a way that it is incremental so you never waste the work you’ve done. Okkam’s entity naming system allows you to share the same identifiers across different projects, different companies, different data sets. You can always build on top of what you have done in the past.”

The benefits of a unique name for everything

http://www.okkam.org/

The community website: http://community.okkam.org/ reports 8.5+ million entities.

When the EU/CORDIS show up late for a party, it’s really late.

A multi-lingual organization like the EU, kudos on their efforts in that direction, should know uniformity of language or identifiers is only found in dystopian fiction.

I prefer the language and cultural richness of Europe over the sterile uniformity of American fast food chains. Same issue.

You?

I first saw this in a tweet by Stefano Bertolo.

February 17, 2014

Understanding Classic SoundEx Algorithms

Filed under: Algorithms,SoundEx,Subject Identity — Patrick Durusau @ 8:53 pm

Understanding Classic SoundEx Algorithms

From the webpage:

Terms that are often misspelled can be a problem for database designers. Names, for example, are variable length, can have strange spellings, and they are not unique. American names have a diversity of ethnic origins, which give us names pronounced the same way but spelled differently and vice versa.

Words too, can be misspelled or have multiple spellings, especially across different cultures or national sources.

To help solve this problem, we need phonetic algorithms which can find similar sounding terms and names. Just such a family of algorithms exist and have come to be called SoundEx algorithms.

A Soundex search algorithm takes a written word, such as a person’s name, as input, and produces a character string that identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data.

The method used by Soundex is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which are themselves based on where you put your lips and tongue to make the sounds.

The algorithm itself is fairly straight forward to code and requires no backtracking or multiple passes over the input word. In fact, it is so straight forward, I will start (after a history section) by presenting it as an outline. Further on, I will give C, JavaScript, Perl, and VB code that implements the two standard algorithms used in the American Census as well as an enhanced version, which is described in this article.

A timely reminder that knowing what is likely to be confused can be more powerful than the details of any particular confusion.

Even domain level semantics may be too difficult to capture. What if we were to capture only the known cases of confusion?

That would be a much smaller set than the domain in general and easier to maintain. (As well as to distinguish in a solution.)

January 29, 2014

Change Tracking, Excel, and Subjects

Filed under: Spreadsheets,Subject Identity — Patrick Durusau @ 2:21 pm

Change tracking is an active topic of discussion in the OpenDocument TC at OASIS. So much so that a sub-committee was formed to create a change tracking proposal for ODF 1.3. OpenDocument – Advanced Document Collaboration SC

In a recent discussion, the sub-committee was reminded that MS Excel, that change tracking is only engaged when working on a “shared workbook.”

If I am working on a non-shared workbook, any changes I make, of whatever nature, formatting, data in cells, formulas, etc., are not being tracked.

Without change tracking, what are several subjects we can’t talk about in an Excel spreadsheet?

  1. We can’t talk about the author of a particular change.
  2. We can’t talk about the author of a change relative to other people or events (such as emails).
  3. We can’t talk about a prior value or formula “as it was.”
  4. We can’t talk about the origin of a prior value or formula.
  5. We can’t talk about a prior value or formula as compared to a later value or formula.

Transparency is the watchword of government and industry.

Opacity is the watchword of spreadsheet change tracking.

Do you see a conflict there?

Supporting the development change tracking in Open Document (ODF) at the OpenDocument TC could shine a bright light in a very dark place.

January 18, 2014

How to Query the StackExchange Databases

Filed under: Data,Subject Identity,Topic Maps — Patrick Durusau @ 8:29 pm

How to Query the StackExchange Databases by Brent Ozar.

From the post:

During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.
….

I’m sure you have never found duplicate questions or answers on StackExchange.

But just in case such a thing existed, detecting and merging the duplicates from StackExchange would be a good exercise at data analysis, subject identification, etc.

😉

BTW, Brent’s webinar is 21 January 2014, or next Tuesday (as of this post).

Enjoy!

January 3, 2014

Wikibase DataModel released!

Filed under: Data Models,Identification,Precision,Subject Identity,Wikidata,Wikipedia — Patrick Durusau @ 5:04 pm

Wikibase DataModel released! by Jeroen De Dauw.

From the post:

I’m happy to announce the 0.6 release of Wikibase DataModel. This is the first real release of this component.

DataModel?

Wikibase is the software behind Wikidata.org. At its core, this software is about describing entities. Entities are collections of claims, which can have qualifiers, references and values of various different types. How this all fits together is described in the DataModel document written by Markus and Denny at the start of the project. The Wikibase DataModel component contains (PHP) domain objects representing entities and their various parts, as well as associated domain logic.

I wanted to draw your attention to this discussion of “items:”

Items are Entities that are typically represented by a Wikipage (at least in some Wikipedia languages). They can be viewed as “the thing that a Wikipage is about,” which could be an individual thing (the person Albert Einstein), a general class of things (the class of all Physicists), and any other concept that is the subject of some Wikipedia page (including things like History of Berlin).

The IRI of an Item will typically be closely related to the URL of its page on Wikidata. It is expected that Items store a shorter ID string (for example, as a title string in MediaWiki) that is used in both cases. ID strings might have a standardized technical format such as “wd1234567890” and will usually not be seen by users. The ID of an Item should be stable and not change after it has been created.

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

However, it is intended that the information stored in Wikidata is generally about the topic of the Item. For example, the Item for History of Berlin should store data about this history (if there is any such data), not about Berlin (the city). It is not intended that data about one subject is distributed across multiple Wikidata Items: each Item fully represents one thing. This also helps for data integration across languages: many languages have no separate article about Berlin’s history, but most have an article about Berlin.

What do you make of the claim:

The exact meaning of an Item cannot be captured in Wikidata (or any technical system), but is discussed and decided on by the community of editors, just as it is done with the subject of Wikipedia articles now. It is possible that an Item has multiple “aspects” to its meaning. For example, the page Orca describes a species of whales. It can be viewed as a class of all Orca whales, and an individual whale such as Keiko would be an element of this class. On the other hand, the species Orca is also a concept about which we can make individual statements. For example, one could say that the binomial name (a Property) of the Orca species has the Value “Orcinus orca (Linnaeus, 1758).”

I may write an information system that fails to distinguish between a species of whales, a class of whales and a particular whale, but that is a design choice, not a foregone conclusion.

In the case of Wikipedia, which relies upon individuals repeating the task of extracting relevant information from loosely gathered data, that approach words quite well.

But there isn’t one degree of precision of identification that works for all cases.

My suspicion is that for more demanding search applications, such as drug interactions, less precise identifications could lead to unfortunate, even fatal, results.

Yes?

November 15, 2013

Thinking, Fast and Slow (Review) [And Subject Identity]

A statistical review of ‘Thinking, Fast and Slow’ by Daniel Kahneman by Patrick Burns.

From the post:

We are good intuitive grammarians — even quite small children intuit language rules. We can see that from mistakes. For example: “I maked it” rather than the irregular “I made it”.

In contrast those of us who have training and decades of experience in statistics often get statistical problems wrong initially.

Why should there be such a difference?

Our brains evolved for survival. We have a mind that is exquisitely tuned for finding things to eat and for avoiding being eaten. It is a horrible instrument for finding truth. If we want to get to the truth, we shouldn’t start from here.

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

The review focuses mainly on statistical issues in “Thinking Fast and Slow” but I think you will find it very entertaining.

I deeply appreciate Patrick’s quoting of:

A remarkable aspect of your mental life is that you are rarely stumped. … you often have answers to questions that you do not completely understand, relying on evidence that you can neither explain nor defend.

In particular:

…relying on evidence that you can neither explain nor defend.

which resonates with me on subject identification.

Think about how we search for subjects, which of necessity involves some notion of subject identity.

What if a colleague asks if they should consult the records of the Order of the Garter to find more information on “Lady Gaga?”

Not entirely unreasonable since “Lady” is conferred upon female recipients of the Order of the Garter.

No standard search technique would explain why your colleague should not start with the Order of the Garter records.

Although I think most of us would agree such a search would be far afield. 😉

Every search starts with a searcher relying upon what they “know,” suspect or guess to be facts about a “subject” to search on.

At the end of the search, the characteristics of the subject as found, turn out to be the characteristics we were searching for all along.

I say all that to suggest that we need not bother users to say how in fact to identity the objects of their searches.

Rather the question should be:

What pointers or contexts are the most helpful to you when searching? (May or may not be properties of the search objective.)

Recalling that properties of the search objective are how we explain successful searches, not how we perform them.

Calling upon users to explain or make explicit what they themselves don’t understand, seems like a poor strategy for adoption of topic maps.

Capturing what “works” for a user, without further explanation or difficulty seems like the better choice.


PS: Should anyone ask about “Lady Gaga,” you can mention that Glamour magazine featured her on its cover, naming her Woman of the Year (December 2013 issue). I know that only because of a trip to the local drug store for a flu shot.

Promised I would be “in and out” in minutes. Literally true I suppose, it only took 50 minutes with four other people present when I arrived.

I have a different appreciation of “minutes” from the pharmacy staff. 😉

November 8, 2013

Restructuring the Web with Git

Filed under: Git,Github,Subject Identity — Patrick Durusau @ 8:04 pm

Restructuring the Web with Git by Simon St. Laurent.

From the post:

Web designers? Git? Github? Aren’t those for programmers? At Artifact, Christopher Schmitt showed designers how much their peers are already doing with Github, and what more they can do. Github (and the underlying Git toolset) changes the way that all kinds of people work together.

Sharing with Git

As amazing as Linux may be, I keep thinking that Git may prove to be Linux Torvalds’ most important contribution to computing. Most people think of it, if they think of it at all, as a tool for managing source code. It can do far more, though, providing a drastically different (and I think better) set of tools for managing distributed projects, especially those that use text.

Git tackles an unwieldy problem, managing the loosely structured documents that humans produce. Text files are incredibly flexible, letting us store everything from random notes to code of all kinds to tightly structured data. As awesome as text files are—readable, searchable, relatively easy to process—they tend to become a mess when there’s a big pile of them.

Simon makes a good argument for the version control and sharing aspects of Github.

But Github doesn’t offer any features (that I am aware of) to manage the semantics of the data stored at Github.

For example, if I search for “greek,” I am returned results that include the Greek language, Greek mythology, New Testament Greek, etc.

There are only four hundred and sixty-five (465) results as of today but even if I look at all of them, I have no reason to think I have found all the relevant resources.

For example, a search on Greek Mythology would miss:

Myths-and-myth-makers–Old-Tales-and-Superstitions-Interpreted-by-Comparative-Mythology_1061, which has one hundred and four (104) references to Greek gods/mythology.

Moreover, now having discovered this work should be returned on a search for Greek Mythology, how do I impart that knowledge to the system so that future users will find that work?

Github works quite well, but it has a ways to go before it improves on the finding of documents.

September 10, 2013

Clusters and DBScan

Filed under: Clustering,K-Means Clustering,Subject Identity — Patrick Durusau @ 9:46 am

Clusters and DBScan by Jesse Johnson.

From the post:

A few weeks ago, I mentioned the idea of a clustering algorithm, but here’s a recap of the idea: Often, a single data set will be made up of different groups of data points, each of which corresponds to a different type of point or a different phenomenon that generated the points. For example, in the classic iris data set, the coordinates of each data point are measurements taken from an iris flower. There are 150 data points, with 50 from each of three species. As one might expect, these data points form three (mostly) distinct groups, called clusters. For a general data set, if we know how many clusters there are and that each cluster is a simple shape like a Gaussian blob, we could determine the structure of the data set using something like K-means or a mixture model. However, in many cases the clusters that make up a data set do not have a simple structure, or we may not know how many there are. In these situations, we need a more flexible algorithm. (Note that K-means is often thought of as a clustering algorithm, but note I’m going to, since it assumes a particular structure for each cluster.)

Jesse has started a series of post on clustering that you will find quite useful.

Particularly if you share my view that clustering is the semantic equivalent of “merging” in TMDM terms without the management of item identifiers.

In the final comment in parentheses, “Note that K-means…” is awkwardly worded. From later in the post you learn that Jesse doesn’t consider K-means to be a clustering algorithm at all.

Wikipedia on DBScan. Which reports that scikit-learn includes a Python implementation of DBScan.

August 16, 2013

Dynamic Simplification

Filed under: Graphics,Subject Identity,Topic Maps,Visualization — Patrick Durusau @ 3:18 pm

Dynamic Simplification by Mike Bostock.

From the post:

A combination of the map zooming and dynamic simplification demonstrations: as the map zooms in and out, the simplification area threshold is adjusted so that it is always appropriate to the current scale. Thus, the map looks good and renders quickly at all points during the animation.

While d3.js is the secret sauce here, I am posting this for the notion of “dynamic simplification.”

What if the presentation of a topic map were to use “dynamic simplification?”

Say that I have a topic map with topics for all the tweets on some major event. (Lady Gaga’s latest video (NSFW) for example.

The number of tweets for some locations would display as a mass of dots. Not terribly informative.

If on the other hand, from say a country wide perspective, the tweets were displayed as a solid form and only on zooming in did they become distinguished (looking to see if Dick Cheney tweeted about it), that would be more useful.

Or at least more useful for some use cases.

The Dynamic Simplification demo is part of a large collection of amazing visuals you will find at: http://bl.ocks.org/mbostock.

July 30, 2013

Subject Identity Obfuscation?

Filed under: Cryptography,Encryption,Subject Identity,Topic Maps — Patrick Durusau @ 9:10 am

Computer Scientists Develop ‘Mathematical Jigsaw Puzzles’ to Encrypt Software

From the post:

UCLA computer science professor Amit Sahai and a team of researchers have designed a system to encrypt software so that it only allows someone to use a program as intended while preventing any deciphering of the code behind it. This is known in computer science as “software obfuscation,” and it is the first time it has been accomplished.

It was the line “…and this is the first time it has been accomplished.” that caught my attention.

I could name several popular scripting languages, at the expense of starting a flame war, that would qualify as “software obfuscation.” 😉

Further from the post:

According to Sahai, previously developed techniques for obfuscation presented only a “speed bump,” forcing an attacker to spend some effort, perhaps a few days, trying to reverse-engineer the software. The new system, he said, puts up an “iron wall,” making it impossible for an adversary to reverse-engineer the software without solving mathematical problems that take hundreds of years to work out on today’s computers — a game-change in the field of cryptography.

The researchers said their mathematical obfuscation mechanism can be used to protect intellectual property by preventing the theft of new algorithms and by hiding the vulnerability a software patch is designed to repair when the patch is distributed.

“You write your software in a nice, reasonable, human-understandable way and then feed that software to our system,” Sahai said. “It will output this mathematically transformed piece of software that would be equivalent in functionality, but when you look at it, you would have no idea what it’s doing.”

The key to this successful obfuscation mechanism is a new type of “multilinear jigsaw puzzle.” Through this mechanism, attempts to find out why and how the software works will be thwarted with only a nonsensical jumble of numbers.

The paper has this title: Candidate Indistinguishability Obfuscation and Functional Encryption for all circuits by Sanjam Garg and Craig Gentry and Shai Halevi and Mariana Raykova and Amit Sahai and Brent Waters.

Abstract:

In this work, we study indistinguishability obfuscation and functional encryption for general circuits:

Indistinguishability obfuscation requires that given any two equivalent circuits C_0 and C_1 of similar size, the obfuscations of C_0 and C_1 should be computationally indistinguishable.

In functional encryption, ciphertexts encrypt inputs x and keys are issued for circuits C. Using the key SK_C to decrypt a ciphertext CT_x = Enc(x), yields the value C(x) but does not reveal anything else about x. Furthermore, no collusion of secret key holders should be able to learn anything more than the union of what they can each learn individually.

We give constructions for indistinguishability obfuscation and functional encryption that supports all polynomial-size circuits. We accomplish this goal in three steps:

  • We describe a candidate construction for indistinguishability obfuscation for NC1 circuits. The security of this construction is based on a new algebraic hardness assumption. The candidate and assumption use a simplified variant of multilinear maps, which we call Multilinear Jigsaw Puzzles.
  • We show how to use indistinguishability obfuscation for NC1 together with Fully Homomorphic Encryption (with decryption in NC1) to achieve indistinguishability obfuscation for all circuits.
  • Finally, we show how to use indistinguishability obfuscation for circuits, public-key encryption, and non-interactive zero knowledge to achieve functional encryption for all circuits. The functional encryption scheme we construct also enjoys succinct ciphertexts, which enables several other applications.

When a paper has a table of contents following the abstract, you know it isn’t a short paper. Forty-three (43) pages counting the supplemental materials. Most of it very heavy sledding.

I think this paper has important implications for sharing topic map based data.

In general as with other data but especially with regard to subject identity and merging rules.

It may well be the case that a subject of interest to you exists in a topic map but if you can’t access its subject identity sufficient to create merging, it will not exist for you.

One can even imagine that a subject may be accessible for screen display but not for copying to a “Snowden drive.” 😉

BTW, I have downloaded a copy of the paper. Suggest you do the same.

Just in case it goes missing several years from now when government security agencies realize its potential.

May 14, 2013

Information organization and the philosophy of history

Filed under: History,Library,Philosophy,Subject Identity — Patrick Durusau @ 3:54 pm

Information organization and the philosophy of history by Ryan Shaw. (Shaw, R. (2013), Information organization and the philosophy of history. J. Am. Soc. Inf. Sci., 64: 1092–1103. doi: 10.1002/asi.22843)

Abstract:

The philosophy of history can help articulate problems relevant to information organization. One such problem is “aboutness”: How do texts relate to the world? In response to this problem, philosophers of history have developed theories of colligation describing how authors bind together phenomena under organizing concepts. Drawing on these ideas, I present a theory of subject analysis that avoids the problematic illusion of an independent “landscape” of subjects. This theory points to a broad vision of the future of information organization and some specific challenges to be met.

You are unlikely to find this article directly actionable in your next topic map project.

On the other hand, if you enjoy the challenge of thinking about how we think, you will find it a real treat.

Shaw writes:

Different interpretive judgments result in overlapping and potentially contradictory organizing principles. Organizing systems ought to make these overlappings evident and show the contours of differences in perspective that distinguish individual judgments. Far from providing a more “complete” view of a static landscape, organizing systems should multiply and juxtapose views. As Geoffrey Bowker (2005) has argued,

the goal of metadata standards should not be to produce a convergent unity. We need to open a discourse—where there is no effective discourse now—about the varying temporalities, spatialities and materialities that we might represent in our databases, with a view to designing for maximum flexibility and allowing as much as possible for an emergent polyphony and polychrony. (pp. 183–184)

The demand for polyphony and polychrony leads to a second challenge, which is to find ways to open the construction of organizing systems to wider participation. How might academics, librarians, teachers, public historians, curators, archivists, documentary editors, genealogists, and independent scholars all contribute to a shared infrastructure for linking and organizing historical discourse through conceptual models? If this challenge can be addressed, the next generation of organizing systems could provide the infrastructure for new kinds of collaborative scholarship and organizing practice.

Once upon a time, you could argue that physical limitations of cataloging systems meant that a single classification system (convergent unity) was necessary for systems to work at all.

But that was an artifact of the physical medium of the catalog.

The deepest irony of the digital age is continuation of the single classification system requirement, a requirement past its discard date.

April 2, 2013

Construction of Controlled Vocabularies

Filed under: Identity,Subject Identity,Subject Recognition,Vocabularies — Patrick Durusau @ 2:01 pm

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

Example:
salinity/saltiness
  VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

Example:
Mercury (planet)
  Mercury (metal)
  Mercury (automobile)
  Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

March 16, 2013

The Next 700 Programming Languages
[Essence of Topic Maps]

Filed under: Language Design,Programming,Subject Identity,Topic Maps — Patrick Durusau @ 1:14 pm

The Next 700 Programming Languages by P. J. Landin.

ABSTRACT:

A family of unimplemented computing languages is described that is intended to span differences of application area by a unified framework. This framework dictates the rules about the uses of user-coined names, and the conventions about characterizing functional relationships. Within this framework ‘lhe design of a specific language splits into two independent parts. One is the choice of written appearances of programs (or more generally, their physical representation). The other is the choice of the abstract entities (such as numbers, character-strings, lists of them, functional relations among them) that can be referred to in the language.

The system is biased towards “expressions” rather than “statements.” It includes a nonprocedural (purely functional) subsystem that aims to expand the class of users’ needs that can be met by a single print-instruction, without sacrificing the important properties that make conventional right-hand-side expressions easy to construct and understand.

The introduction to this paper reminded me of an acronym, SWIM (See What I Mean) that was coined to my knowledge by Michel Biezunski several years ago:

Most programming languages are partly a way of expressing things in terms of other things and partly a basic set of given things. The ISWIM (If you See What I Mean) system is a byproduct of an attempt to disentangle these two aspects in some current languages.

This attempt has led the author to think that many linguistic idiosyncracies are concerned with the former rather than the latter, whereas aptitude for a particular class of tasks is essentially determined by the latter rather than the former. The conclusion follows that many language characteristics are irrelevant to the alleged problem orientation.

ISWIM is an attempt at a general purpose system for describing things in terms of other things, that can be problem-oriented by appropriate choice of “primitives.” So it is not a language so much as a family of languages, of which each member is the result of choosing a set of primitives. The possibilities concerning this set and what is needed to specify such a set are discussed below.

The essence of topic maps is captured by:

ISWIM is an attempt at a general purpose system for describing things in terms of other things, that can be problem-oriented by appropriate choice of “primitives.”

Every information system has a set of terms, the meaning of which are known to its designers and/or users.

Data integration issues arise from the description of terms, “in terms of other things,” being known only to designers and users.

The power of topic maps comes from the expression of descriptions “in terms of other things,” for terms.

Other designers or users can examine those descriptions to see if they recognize any terms similar to those they know by other descriptions.

If they discover descriptions they consider to be of same thing, they can then create a mapping of those terms.

Hopefully using the descriptions as a basis for the mapping. A mapping of term to term only multiplies the opaqueness of the terms.

For some systems, Social Security Administration databases for example, descriptions of terms “in terms of other things” may not be part of the database itself. But descriptions maintained as “best practice” to facilitate later maintenance and changes.

For other systems, U.S. Intelligence community as another example, still chasing the will-o’-the-wisp* of standard terminology for non-standard terms, even the possibility of interchange depends on the development of description of terms “in terms of other things.”

Before you ask, yes, yes the Topic Maps Data Model (TMDM) and the various Topic Maps syntaxes are terms that can be described “in terms of other things.”

The advantage of the TMDM and relevant syntaxes is that even if not described “in terms of other things,” standardized terms enable interchange of a class of mappings. The default identification mapping in the TMDM being by IRIs.

Before and since Landin’s article we have been producing terms that could be described “in terms of other things.” In CS and other areas of human endeavor as well.

Isn’t it about time we starting describing our terms rather than clamoring for one set of undescribed terms or another?


* I use the term will-o’-the-wisp quite deliberately.

After decades of failure to create universal information systems with computers, following on centuries of non-computer failures to reach the same goal, following on millennia of semantic and linguistic diversity, someone knows attempts at universal information systems will leave intelligence agencies not sharing critical data.

Perhaps the method you choose says a great deal about the true goals of your project.

I first saw this in a tweet by CompSciFact.

March 11, 2013

Onomastics 2.0 – The Power of Social Co-Occurrences

Filed under: co-occurrence,Names,Onomastics,Subject Identity — Patrick Durusau @ 6:45 am

Onomastics 2.0 – The Power of Social Co-Occurrences by Folke Mitzlaff, Gerd Stumme.

Abstract:

Onomastics is “the science or study of the origin and forms of proper names of persons or places.” [“Onomastics”. Merriam-Webster.com, 2013. this http URL (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste.

With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names.

The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed.

The discovered relations among given names are the foundation of “nameling” [this http URL], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.

Interesting work on the co-occurrence of names.

Chosen names in this case but I wonder if the same would be true for false names?

Are there patterns to false names chosen by actors who are attempting to conceal their identities?

I first saw this in a tweet by Stefano Bertolo.

March 6, 2013

VIAF: The Virtual International Authority File

Filed under: Authority Record,Library,Library Associations,Merging,Subject Identity — Patrick Durusau @ 11:19 am

VIAF: The Virtual International Authority File

From the webpage:

VIAF, implemented and hosted by OCLC, is a joint project of several national libraries plus selected regional and trans-national library agencies. The project’s goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.

The “about” link at the bottom of the page is broken (in the English version). A working “about” link for VIAF reports:

At a glance

  • A collaborative effort between national libraries and organizations contributing name authority files, furthering access to information
  • All authority data for a given entity is linked together into a “super” authority record
  • A convenient way for the library community and other agencies to repurpose bibliographic data produced by libraries serving different language communities

The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world’s major name authority files. Its creators envision the VIAF as a building block for the Semantic Web to enable switching of the displayed form of names for persons to the preferred language and script of the Web user. VIAF began as a joint project with the Library of Congress (LC), the Deutsche Nationalbibliothek (DNB), the Bibliothèque nationale de France (BNF) and OCLC. It has, over the past decade, become a cooperative effort involving an expanding number of other national libraries and other agencies. At the beginning of 2012, contributors include 20 agencies from 16 countries.

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF helps to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF matches and links the authority files of national libraries and groups all authority records for a given entity into a merged “super” authority record that brings together the different names for that entity. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities.

If you were to substitute for ‘”super” authority record,” the term topic, you would be part of the way towards a topic map.

Topics gather information about a given entity into a single location.

Topics differ from the authority records you find at VIAF in two very important ways:

  1. First, topics, unlike authority records, have the ability to merge with other topics, creating new topics that have more information than any of the original topics.
  2. Second, authority records are created by, well, authorities. Do you see your name or the name of your organization on the list at VIAF? Topics can be created by anyone and merged with other topics on terms chosen by the possessor of the topic map. You don’t have to wait for an authority to create the topic or approve your merging of it.

There are definite advantages to having authorities and authority records, but there are also advantages to having the freedom to describe your world, in your terms.

March 2, 2013

Hellerstein: Humans are the Bottleneck [Not really]

Filed under: Data,Subject Identity,Topic Maps — Patrick Durusau @ 5:06 pm

Hellerstein: Humans are the Bottleneck by Isaac Lopez.

From the post:

Humans are the bottleneck right now in the data space, commented database systems luminary, Joe Hellerstein during an interview this week at Strata 2013.

“As Moore’s law drives the cost of computing down, and as data becomes more prevalent as a result, what we see is that the remaining bottleneck in computing costs is the human factor,” says Hellerstein, one of the fathers of adaptive query processing and a half dozen other database technologies.

Hellerstein says that recent research studies conducted at Stanford and Berkeley have found that 50-80 percent of a data analyst’s time is being used for the data grunt work (with the rest left for custom coding, analysis, and other duties).

“Data prep, data wrangling, data munging are words you hear over and over,” says Hellerstein. “Even with very highly skilled professionals in the data analysis space, this is where they’re spending their time, and it really is a big bottleneck.”

Just because humans gather at a common location, in “data prep, data wrangling, data munging,” doesn’t mean they “are the bottleneck.”

The question to ask is: Why are people spending so much time at location X in data processing?

Answer: poor data quality and/or rather the inability of machines to process effectively data from different origins. That’s the bottleneck.

A problem that management of subject identities for data and its containers is uniquely poised to solve.

« Newer PostsOlder Posts »

Powered by WordPress