Archive for the ‘Subject Identity’ Category

No Properties/No Structure – But, Subject Identity

Thursday, September 8th, 2016

Jack Park has prodded me into following some category theory and data integration papers. More on that to follow but as part of that, I have been watching Bartosz Milewski’s lectures on category theory, reading his blog, etc.

In Category Theory 1.2, Mileski goes to great lengths to emphasize:

Objects are primitives with no properties/structure – a point

Morphism are primitives with no properties/structure, but do have a start and end point

Late in that lecture, Milewski says categories are the “ultimate in data hiding” (read abstraction).

Despite their lack of properties and structure, both objects and morphisms have subject identity.

Yes?

I think that is more than clever use of language and here’s why:

If I want to talk about objects in category theory as a group subject, what can I say about them? (assuming a scope of category theory)

1. Objects have no properties
2. Objects have no structure
3. Objects mark the start and end of morphisms (distinguishes them from morphisms)
4. Every object has an identity morphism
5. Every pair of objects may have 0, 1, or many morphisms between them
6. Morphisms may go in both directions, between a pair of morphisms
7. An object can have multiple morphisms that start and end at it

Incomplete and yet a lot of things to say about something that has no properties and no structure. 😉

Bearing in mind, that’s just objects in general.

I can also talk about a specific object at a particular time point in the lecture and screen location, which itself is a subject.

Or an object in a paper or monograph.

We can declare primitives, like objects and morphisms, but we should always bear in mind they are declared to be primitives.

For other purposes, we can declare them to be otherwise.

Flawed Input Validation = Flawed Subject Recognition

Friday, May 13th, 2016

In Vulnerable 7-Zip As Poster Child For Open Source, I covered some of the details of two vulnerabilities in 7-Zip.

Both of those vulnerabilities were summarized by the discoverers:

Sadly, many security vulnerabilities arise from applications which fail to properly validate their input data. Both of these 7-Zip vulnerabilities resulted from flawed input validation. Because data can come from a potentially untrusted source, data input validation is of critical importance to all applications’ security.

The first vulnerability is described as:

An out-of-bounds read vulnerability exists in the way 7-Zip handles Universal Disk Format (UDF) files. The UDF file system was meant to replace the ISO-9660 file format, and was eventually adopted as the official file system for DVD-Video and DVD-Audio.

Central to 7-Zip’s processing of UDF files is the CInArchive::ReadFileItem method. Because volumes can have more than one partition map, their objects are kept in an object vector. To start looking for an item, this method tries to reference the proper object using the partition map’s object vector and the “PartitionRef” field from the Long Allocation Descriptor. Lack of checking whether the “PartitionRef” field is bigger than the available amount of partition map objects causes a read out-of-bounds and can lead, in some circumstances, to arbitrary code execution.

(code in original post omitted)

This vulnerability can be triggered by any entry that contains a malformed Long Allocation Descriptor. As you can see in lines 898-905 from the code above, the program searches for elements on a particular volume, and the file-set starts based on the RootDirICB Long Allocation Descriptor. That record can be purposely malformed for malicious purpose. The vulnerability appears in line 392, when the PartitionRef field exceeds the number of elements in PartitionMaps vector.

I would describe the lack of a check on the “PartitionRef” field in topic maps terms as allowing a subject, here a string, of indeterminate size. That is there is no constraint on the size of the subject, which is here a string.

That may seem like an obtuse way of putting it, but consider that for a subject, here a string that is longer than the “available amount of partition may objects,” can be in association with other subjects, such as the user (subject) who has invoked the application(association) containing the 7-Zip vulnerability (subject).

Err, you don’t allow users with shell access to suid root do you?

If you don’t, at least not running a vulnerable program as root may help dodge that bullet.

Or in topic maps terms, knowing the associations between applications and users may be a window on the severity of vulnerabilities.

Lest you think logging suid is an answer, remember they were logging Edward Snowden’s logins as well.

Suid logs may help for next time, but aren’t preventative in nature.

BTW, if you are interested in the details on buffer overflows, Smashing The Stack For Fun And Profit looks like a fun read.

No Label (read “name”) for Medical Error – Fear of Terror

Wednesday, May 4th, 2016

From the post:

Medical error is the third leading cause of death in the US, accounting for 250,000 deaths every year, according to an analysis released on Tuesday.

There is no US system for coding these deaths, but Martin Makary and Michael Daniel, researchers at Johns Hopkins University’s school of medicine, used studies from 1999 onward to find that medical errors account for more than 9.5% of all fatalities in the US.

Only heart disease and cancer are more deadly, according to the Centers for Disease Control and Prevention (CDC).

The analysis, which was published in the British Medical Journal, said that the science behind medical errors would improve if data was shared internationally and nationally “in the same way as clinicians share research and innovation about coronary artery disease, melanoma, and influenza”.

But death by medical error is not captured by government reports because the US system for assigning a code to cause of death, the international classification of disease (ICD), does not have a label for medical error.

In contrast to topic maps, where you can talk about any subject you want, the international classification of disease (ICD), does not have a label for medical error.

Impact? Not having a label conceals approximately 250,000 deaths per year in the United States.

What if Fear of Terror press releases were broadcast but along with “deaths due to medical error to date this year” as contextual information?

Medical errors result in approximately 685 deaths per day.

If you heard the report of the shootings in San Bernardino, December 2, 2015 and that 14 people were killed and the report pointed out that to date, approximately 230,160 had died due to medical errors, which one would you judge to be the more serious problem?

Lacking a label for medical error as cause of death, prevents public discussion of the third leading cause of death in the United States.

Contrast that with the public discussion over the largely non-existent problem of terrorism in the United States.

Searching for Subjects: Which Method is Right for You?

Wednesday, April 20th, 2016

Leaving to one side how to avoid re-evaluating the repetitive glut of materials from any search, there is the more fundamental problem of how to you search for a subject?

This is a back-of-the-envelope sketch that I will be expanding, but here goes:

Basic Search

At its most basic, a search consists of a <term> and the search seeks to match strings that match that <term>.

Even allowing for Boolean operators, the matches against <term> are only and forever string matches.

Basic Search + Synonyms

Of course, as skilled searchers you will try not only one <term>, but several <synonym>s for the term as well.

A good example of that strategy is used at PubMed:

If you enter an entry term for a MeSH term the translation will also include an all fields search for the MeSH term associated with the entry term. For example, a search for odontalgia will translate to: “toothache”[MeSH Terms] OR “toothache”[All Fields] OR “odontalgia”[All Fields] because Odontalgia is an entry term for the MeSH term toothache. [PubMed Help]

The expansion to include the MeSH term Odontalgia is useful, but how do you maintain it?

A reader can see “toothache” and “Odontalgia” are treated as synonyms, but why remains elusive.

This is the area of owl:sameAs, the mapping of multiple subject identifiers/locators to a single topic, etc. You know that “sameness” exists, but why isn’t clear.

Subject Identity Properties

In order to maintain a PubMed or similar mapping, you need people who either “know” the basis for the mappings or you can have the mappings documented. That is you can say on what basis the mapping happened and what properties were present.

For example:

toothache

 Key Value symptom pain general-location mouth specific-location tooth

So if we are mapping terms to other terms and the specific location value reads “tongue,” then we know that isn’t a mapping to “toothache.”

How Far Do You Need To Go?

Of course for every term that we use as a key or value, there can be an expansion into key/value pairs, such as for tooth:

tooth

 Key Value general-location mouth composition enamel coated bone use biting, chewing

Observations:

Each step towards more precise gathering of information increases your pre-search costs but decreases your post-search cost of casting out irrelevant material.

Moreover, precise gathering of information will help you avoid missing data simply due to data glut returns.

If maintenance of your mapping across generations is a concern, doing more than mapping of synonyms for reason or reasons unknown may be in order.

The point being that your current retrieval or lack thereof of current and correct information has a cost. As does improving your current retrieval.

The question of improved retrieval isn’t ideological but an ROI driven one.

• If you have better mappings will that give you an advantage over N department/agency?
• Will better retrieval slow down (never stop) the time wasted by staff on voluminous search results?
• Will more precision focus your limited resources (always limited) on highly relevant materials?

Formulate your own ROI questions and means of measuring them. Then reach out to topic maps to see how they improve (or not) your ROI.

Properly used, I think you are in for a pleasant surprise with topic maps.

Coeffects: Context-aware programming languages – Subject Identity As Type Checking?

Tuesday, April 12th, 2016

From the webpage:

Coeffects are Tomas Petricek‘s PhD research project. They are a programming language abstraction for understanding how programs access the context or environment in which they execute.

The context may be resources on your mobile phone (battery, GPS location or a network printer), IoT devices in a physical neighborhood or historical stock prices. By understanding the neighborhood or history, a context-aware programming language can catch bugs earlier and run more efficiently.

This page is an interactive tutorial that shows a prototype implementation of coeffects in a browser. You can play with two simple context-aware languages, see how the type checking works and how context-aware programs run.

This page is also an experiment in presenting programming language research. It is a live environment where you can play with the theory using the power of new media, rather than staring at a dead pieces of wood (although we have those too).

(break from summary)

Programming languages evolve to reflect the changes in the computing ecosystem. The next big challenge for programming language designers is building languages that understand the context in which programs run.

This challenge is not easy to see. We are so used to working with context using the current cumbersome methods that we do not even see that there is an issue. We also do not realize that many programming features related to context can be captured by a simple unified abstraction. This is what coeffects do!

What if we extend the idea of context to include the context within which words appear?

For example, writing a police report, the following sentence appeared:

There were 20 or more <proxy string=”black” pos=”noun” synonym=”African” type=”race”/>s in the group.

For display purposes, the string value “black” appears in the sentence:

There were 20 or more blacks in the group.

But a search for the color “black” would not return that report because the type = color does not match type = race.

On the other hand, if I searched for African-American, that report would show up because “black” with type = race is recognized as a synonym for people of African extraction.

Inline proxies are the easiest to illustrate but that is only one way to serialize such a result.

If done in an authoring interface, such an approach would have the distinct advantage of offering the original author the choice of subject properties.

The advantage of involving the original author is that they have an interest in and awareness of the document in question. Quite unlike automated processes that later attempt annotation by rote.

No Perception Without Cartography [Failure To Communicate As Cartographic Failure]

Saturday, April 9th, 2016

Dan Klyn tweeted:

No perception without cartography

with an image of this text (from Self comes to mind: constructing the conscious mind by Antonio R Damasio):

The nonverbal kinds of images are those that help you display mentally the concepts that correspond to words. The feelings that make up the background of each mental instant and that largely signify aspects of the body state are images as well. Perception, in whatever sensory modality, is the result of the brain’s cartographic skill.

Images represent physical properties of entities and their spatial and temporal relationships, as well as their actions. Some images, which probably result from the brain’s making maps of itself making maps, are actually quite abstract. They describe patterns of occurrence of objects in time and space, the spatial relationships and movement of objects in terms of velocity and trajectory, and so forth.

Dan’s tweet spurred me to think that our failures to communicate to others could be described as cartographic failures.

If we use a term that is unknown to the average reader, say “daat,” the reader lacks a mental mapping that enables interpretation of that term.

Even if you know the term, it doesn’t stand in isolation in your mind. It fits into a number of maps, some of which you may be able to articulate and very possibly into other maps, which remain beyond your (and our) ken.

Not that this is a light going off experience for you or me but perhaps the cartographic imagery may be helpful in illustrating both the value and the risks of topic maps.

The value of topic maps is spoken of often but the risks of topic maps rarely get equal press.

How would topic maps be risky?

Felienne Hermans in Spreadsheets: The Ununderstood Dark Matter of IT makes a persuasive case that spreadsheets are on an average five years old with little or no documentation.

If those spreadsheets remain undocumented, both users and auditors are equally stymied by their ignorance, a cartographic failure that leaves both wondering what must have been meant by columns and operations in the spreadsheet.

To the extent that a topic map or other disclosure mechanism preserves and/or restores the cartography that enables interpretation of the spreadsheet, suddenly staff are no longer plausibly ignorant of the purpose or consequences of using the spreadsheet.

Facile explanations that change from audit to audit are no longer possible. Auditors are chargeable with consistent auditing from one audit to another.

Does it sound like there is going to be a rush to use topic maps or other mechanisms to make spreadsheets transparent?

Still, transparency that befalls one could well benefit another.

Or to paraphrase King David (2 Samuel 11:25):

Ready to inflict transparency on others?

Graph Encryption: Going Beyond Encrypted Keyword Search [Subject Identity Based Encryption]

Wednesday, March 2nd, 2016

From the post:

Encrypted search has attracted a lot of attention from practitioners and researchers in academia and industry. In previous posts, Seny already described different ways one can search on encrypted data. Here, I would like to discuss search on encrypted graph databases which are gaining a lot of popularity.

1. Graph Databases and Graph Privacy

As today’s data is getting bigger and bigger, traditional relational database management systems (RDBMS) cannot scale to the massive amounts of data generated by end users and organizations. In addition, RDBMSs cannot effectively capture certain data relationships; for example in object-oriented data structures which are used in many applications. Today, NoSQL (Not Only SQL) has emerged as a good alternative to RDBMSs. One of the many advantages of NoSQL systems is that they are capable of storing, processing, and managing large volumes of structured, semi-structured, and even unstructured data. NoSQL databases (e.g., document stores, wide-column stores, key-value (tuple) store, object databases, and graph databases) can provide the scale and availability needed in cloud environments.

In an Internet-connected world, graph database have become an increasingly significant data model among NoSQL technologies. Social networks (e.g., Facebook, Twitter, Snapchat), protein networks, electrical grid, Web, XML documents, networked systems can all be modeled as graphs. One nice thing about graph databases is that they store the relations between entities (objects) in addition to the entities themselves and their properties. This allows the search engine to navigate both the data and their relationships extremely efficiently. Graph databases rely on the node-link-node relationship, where a node can be a profile or an object and the edge can be any relation defined by the application. Usually, we are interested in the structural characteristics of such a graph databases.

What do we mean by the confidentiality of a graph? And how to do we protect it? The problem has been studied by both the security and database communities. For example, in the database and data mining community, many solutions have been proposed based on graph anonymization. The core idea here is to anonymize the nodes and edges in the graph so that re-identification is hard. Although this approach may be efficient, from a security point view it is hard to tell what is achieved. Also, by leveraging auxiliary information, researchers have studied how to attack this kind of approach. On the other hand, cryptographers have some really compelling and provably-secure tools such as ORAM and FHE (mentioned in Seny’s previous posts) that can protect all the information in a graph database. The problem, however, is their performance, which is crucial for databases. In today’s world, efficiency is more than running in polynomial time; we need solutions that run and scale to massive volumes of data. Many real world graph datasets, such as biological networks and social networks, have millions of nodes, some even have billions of nodes and edges. Therefore, besides security, scalability is one of main aspects we have to consider.

2. Graph Encryption

Previous work in encrypted search has focused on how to search encrypted documents, e.g., doing keyword search, conjunctive queries, etc. Graph encryption, on the other hand, focuses on performing graph queries on encrypted graphs rather than keyword search on encrypted documents. In some cases, this makes the problem harder since some graph queries can be extremely complex. Another technical challenge is that the privacy of nodes and edges needs to be protected but also the structure of the graph, which can lead to many interesting research directions.

Graph encryption was introduced by Melissa Chase and Seny in [CK10]. That paper shows how to encrypt graphs so that certain graph queries (e.g., neighborhood, adjacency and focused subgraphs) can be performed (though the paper is more general as it describes structured encryption). Seny and I, together with Kobbi Nissim and George Kollios, followed this up with a paper last year [MKNK15] that showed how to handle more complex graphs queries.

Apologies for the long quote but I thought this topic might be new to some readers. Xianrui goes on to describe a solution for efficient queries over encrypted graphs.

Chase and Kamara remark in Structured Encryption and Controlled Disclosure, CK10:

To address this problem we introduce the notion of structured encryption. A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key. In addition, the query process reveals no useful information about either the query or the data. An important consideration in this context is the efficiency of the query operation on the server side. In fact, in the context of cloud storage, where one often works with massive datasets, even linear time operations can be infeasible. (emphasis in original)

With just a little nudging, their:

A structured encryption scheme encrypts structured data in such a way that it can be queried through the use of a query-specific token that can only be generated with knowledge of the secret key.

could be re-stated as:

A subject identity encryption scheme leaves out merging data in such a way that the resulting topic map can only be queried with knowledge of the subject identity merging key.

You may have topics that represent diagnoses such as cancer, AIDS, sexual contacts, but if none of those can be associated with individuals who are also topics in the map, there is no more disclosure than census results for a metropolitan area and a list of the citizens therein.

That is you are missing the critical merging data that would link up (associate) any diagnosis with a given individual.

Multi-property subject identities would make the problem even harder, so say nothing of conferring properties on the basis of supplied properties as part of the merging process.

One major benefit of a subject identity based approach is that without the merging key, any data set, however sensitive the information, is just a data set, until you have the basis for solving its subject identity riddle.

PS: With the usual caveats of not using social security numbers, birth dates and the like as your subject identity properties. At least not in the map proper. I can think of several ways to generate keys for merging that would be resistant to even brute force attacks.

Ping me if you are interested in pursuing that on a data set.

Challenges of Electronic Dictionary Publication

Wednesday, February 17th, 2016

Challenges of Electronic Dictionary Publication

From the webpage:

April 8-9th, 2016

Venue: University of Leipzig, GWZ, Beethovenstr. 15; H1.5.16

This April we will be hosting our first Dictionary Journal workshop. At this workshop we will give an introduction to our vision of „Dictionaria“, introduce our data model and current workflow and will discuss (among others) the following topics:

• Methodology and concept: How are dictionaries of „small“ languages different from those of „big“ languages and what does this mean for our endeavour? (documentary dictionaries vs. standard dictionaries)
• Reviewing process and guidelines: How to review and evaluate a dictionary database of minor languages?
• User-friendliness: What are the different audiences and their needs?
• Submission process and guidelines: reports from us and our first authors on how to submit and what to expect
• Citation: How to cite dictionaries?

If you are interested in attending this event, please send an e-mail to dictionary.journal[AT]uni-leipzig.de.

Workshop program

See the webpage for a list of confirmed participants, some with submitted abstracts.

Any number of topic map related questions arise in a discussion of dictionaries.

• How to represent dictionary models?
• What properties should be used to identify the subjects that represent dictionary models?
• On what basis, if any, should dictionary models be considered the same or different? And for what purposes?
• What data should be captured by dictionaries and how should it be identified?
• etc.

Those are only a few of the questions that could be refined into dozens, if not hundreds of more, when you reach the details of constructing a dictionary.

I won’t be attending but wait with great anticipation the output from this workshop!

You Can Confirm A Gravity Wave!

Saturday, February 13th, 2016

Unless you have been unconscious since last Wednesday, you have heard about the confirmation of Einstein’s 1916 prediction of gravitational waves.

An very incomplete list of popular reports include:

For the full monty, see the LIGO Scientific Collaboration itself.

Which brings us to the iPython notebook with the gravitational wave discovery data: Signal Processing with GW150914 Open Data

From the post:

Welcome! This ipython notebook (or associated python script GW150914_tutorial.py ) will go through some typical signal processing tasks on strain time-series data associated with the LIGO GW150914 data release from the LIGO Open Science Center (LOSC):

To begin, download the ipython notebook, readligo.py, and the data files listed below, into a directory / folder, then run it. Or you can run the python script GW150914_tutorial.py. You will need the python packages: numpy, scipy, matplotlib, h5py.

On Windows, or if you prefer, you can use a python development environment such as Anaconda (https://www.continuum.io/why-anaconda) or Enthought Canopy (https://www.enthought.com/products/canopy/).

Questions, comments, suggestions, corrections, etc: email losc@ligo.org

v20160208b

Unlike the toadies at the New England Journal of Medicine, Parasitic Re-use of Data? Institutionalizing Toadyism, Addressing The Concerns Of The Selfish, the scientists who have labored for decades on the gravitational wave question are giving their data away for free!

Not only giving the data away, but striving to help others learn to use it!

Beyond simply “doing the right thing,” and setting an example for other scientists, this is a great opportunity to learn more about signal processing.

Signal processing being an important method of “subject identification” when you stop to think about it in a large number of domains.

Detecting a gravity wave is beyond your personal means but with the data freely available…, further analysis is a matter of interest and perseverance.

Infinite Dimensional Word Embeddings [Variable Representation, Death to Triples]

Thursday, November 19th, 2015

Abstract:

We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality, which can be defined over a countably infinite domain by employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable. We find that the distribution over embedding dimensionality for a given word is highly interpretable and leads to an elegant probabilistic mechanism for word sense induction. We show qualitatively and quantitatively that the iSG produces parameter-efficient representations that are robust to language’s inherent ambiguity.

Even better from the introduction:

To better capture the semantic variability of words, we propose a novel embedding method that produces vectors with stochastic dimensionality. By employing the same mathematical tools that allow the definition of an Infinite Restricted Boltzmann Machine (Côté & Larochelle, 2015), we describe ´a log-bilinear energy-based model–called the Infinite Skip-Gram (iSG) model–that defines a joint distribution over a word vector, a context vector, and their dimensionality, which has a countably infinite domain. During training, the iSGM allows word representations to grow naturally based on how well they can predict their context. This behavior enables the vectors of specific words to use few dimensions and the vectors of vague words to elongate as needed. Manual and experimental analysis reveals this dynamic representation elegantly captures specificity, polysemy, and homonymy without explicit definition of such concepts within the model. As far as we are aware, this is the first word embedding method that allows representation dimensionality to be variable and exhibit data-dependent growth.

Imagine a topic map model that “allow[ed] representation dimensionality to be variable and exhibit data-dependent growth.

Simple subjects, say the sort you find at schema.org, can have simple representations.

More complex subjects, say the notion of “person” in U.S. statutory law (no, I won’t attempt to list them here), can extend its dimensional representation as far as is necessary.

Of course in this case, the dimensions are learned from a corpus but I don’t see any barrier to the intentional creation of dimensions for subjects and/or a combined automatic/directed creation of dimensions.

Or as I put it in the title, Death to All Triples.

More precisely, not just triples but any pre-determined limit on representation.

Knowing the Name of Something vs. Knowing How To Identify Something

Wednesday, November 18th, 2015

Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something

From the post:

In this short clip (below), Feynman articulates the difference between knowing the name of something and understanding it.

See that bird? It’s a brown-throated thrush, but in Germany it’s called a halzenfugel, and in Chinese they call it a chung ling and even if you know all those names for it, you still know nothing about the bird. You only know something about people; what they call the bird. Now that thrush sings, and teaches its young to fly, and flies so many miles away during the summer across the country, and nobody knows how it finds its way.

Knowing the name of something doesn’t mean you understand it. We talk in fact-deficient, obfuscating generalities to cover up our lack of understanding.

You won’t get to see the Feynman quote live because it has been blocked by BBC Worldwide on copyright grounds. No doubt they make a bag full of money every week off that 179 second clip of Feynman.

The stronger point for Feynman would be to point out that you can’t recognize anything on the basis of knowing a name.

I may be sitting next to Cindy Lou Who on the bus but knowing her name isn’t going to help me to recognize her.

Knowing the name of someone or something isn’t useful unless you know something about the person or thing you associate with a name.

That is you know when it is appropriate to use the name you have learned and when to say: “Sorry, I don’t know your name or the name of (indicating in some manner).” At which point you will learn a new name and store a new set of properties to know when to use that name, instead of any other name you know.

Everyone does that exercise, learning new names and the properties that establish when it is appropriate to use a particular name. And we do so seamlessly.

So seamlessly that when called upon to make explicit “how” we know which name to use, subject identification in other words, it takes a lot of effort.

It’s enough effort that it should be done only when necessary and when we can show the user an immediate semantic ROI for their effort.

More on this to follow.

What Does Probability Mean in Your Profession? [Divergences in Meaning]

Sunday, September 27th, 2015

What Does Probability Mean in Your Profession? by Ben Orlin.

Impressive drawings that illustrate the divergence in meaning of “probability” for various professions.

I’m not sold on the “actual meaning” drawing because if everyone in a discipline understands “probability” to mean something else, on what basis can you argue for the “actual meaning?”

If I am reading a paper by someone who subscribes to a different meaning than your claimed “actual” one, then I am going to reach erroneous conclusions about their paper. Yes?

That is in order to understand a paper I have to understand the words as they are being used by the author. Yes?

If I understand “democracy and freedom” to mean “serves the interest of U.S.-based multinational corporations,” then calls for “democracy and freedom” in other countries isn’t going to impress me all that much.

Enjoy the drawings!

Value of Big Data Depends on Identities in Big Data

Tuesday, September 15th, 2015

Intel Exec: Extracting Value From Big Data Remains Elusive by George Leopold.

From the post:

Intel Corp. is convinced it can sell a lot of server and storage silicon as big data takes off in the datacenter. Still, the chipmaker finds that major barriers to big data adoption remain, most especially what to do with all those zettabytes of data.

“The dirty little secret about big data is no one actually knows what to do with it,” Jason Waxman, general manager of Intel’s Cloud Platforms Group, asserted during a recent company datacenter event. Early adopters “think they know what to do with it, and they know they have to collect it because you have to have a big data strategy, of course. But when it comes to actually deriving the insight, it’s a little harder to go do.”

Put another way, industry analysts rate the difficulty of determining the value of big data as far outweighing considerations like technological complexity, integration, scaling and other infrastructure issues. Nearly two-thirds of respondents to a Gartner survey last year cited by Intel stressed they are still struggling to determine the value of big data.

“Increased investment has not led to an associated increase in organizations reporting deployed big data projects,” Gartner noted in its September 2014 big data survey. “Much of the work today revolves around strategy development and the creation of pilots and experimental projects.”

It may just be me, but “determing value,” “risk and governance,” and “integrating multiple data sources,” the top three barriers to use of big data, all depend on knowing the identities represented in big data.

The trivial data integration demos that share “customer-ID” fields, don’t inspire a lot of confidence about data integration when “customer-ID” maybe identified in as many ways as there are data sources. And that is a minor example.

It would be very hard to determine the value you can extract from data when you don’t know what the data represents, its accuracy (risk and governance), and what may be necessary to integrate it with other data sources.

More processing power from Intel is always welcome but churning poorly understood big data faster isn’t going to create value. Quite the contrary, investment in more powerful hardware isn’t going to be favorably reflected on the bottom line.

Investment in capturing the diverse identities in big data will empower easier valuation of big data, evaluation of its risks and uncovering how to integrate diverse data sources.

Capturing diverse identities won’t be easy, cheap or quick. But not capturing them will leave the value of Big Data unknown, its risks uncertain and integration a crap shoot when it is ever attempted.

Things That Are Clear In Hindsight

Saturday, August 1st, 2015

Sean Gallagher recently tweeted:

Oh look, the Triumphalism Trilogy is now a boxed set.

In case you are unfamiliar with the series, The Tipping Point, Blink, Outliers.

Although entertaining reads, particularly The Tipping Point (IMHO), Gladwell does not describe how to recognize a tipping point in advance of it being a tipping point, nor how to make good decisions without thinking (Blink) or how to recognize human potential before success (Outliers).

Tipping points, good decisions and human potential can be recognized only when they are manifested.

As you can tell from Gladwell’s book sales, selling the hope of knowing the unknowable, remains a viable market.

Which Functor Do You Mean?

Monday, July 6th, 2015

Peteris Krumins calls attention to the classic confusion of names that topic maps address in On Functors.

From the post:

It’s interesting how the term “functor” means completely different things in various programming languages. Take C++ for example. Everyone who has mastered C++ knows that you call a class that implements operator() a functor. Now take Standard ML. In ML functors are mappings from structures to structures. Now Haskell. In Haskell functors are just homomorphisms over containers. And in Prolog functor means the atom at the start of a structure. They all are different. Let’s take a closer look at each one.

Peter has said twice in the first paragraph that each of these “functors” is different. Don’t rush to his 2010 post to point out they are different. That was the point of the post. Yes?

Exercise: All of these uses of functor could be scoped by language. What properties of each “functor” would you use to distinguish them beside their language of origin?

Attribute-Based Access Control with a graph database [Topic Maps at NIST?]

Tuesday, April 14th, 2015

Attribute-Based Access Control with a graph database by Robin Bramley.

From the post:

Traditional access control relies on the identity of a user, their role or their group memberships. This can become awkward to manage, particularly when other factors such as time of day, or network location come into play. These additional factors, or attributes, require a different approach, the US National Institute of Standards and Technology (NIST) have published a draft special paper (NIST 800-162) on Attribute-Based Access Control (ABAC).

This post, and the accompanying Graph Gist, explore the suitability of using a graph database to support policy decisions.

Before we dive into the detail, it’s probably worth mentioning that I saw the recent GraphGist on Entitlements and Access Control Management and that reminded me to publish my Attribute-Based Access Control GraphGist that I’d written some time ago, originally in a local instance having followed Stefan Armbruster’s post about using Docker for that very purpose.

Using a Property Graph, we can model attributes using relationships and/or properties. Fine-grained relationships without qualifier properties make patterns easier to spot in visualisations and are more performant. For the example provided in the gist, the attributes are defined using solely fine-grained relationships.

Graph visualization (and querying) of attribute-based access control.

I found this portion of the NIST draft particularly interesting:

There are characteristics or attributes of a subject such as name, date of birth, home address, training record, and job function that may, either individually or when combined, comprise a unique identity that distinguishes that person from all others. These characteristics are often called subject attributes. The term subject attributes is used consistently throughout this document.

In the course of a person’s life, he or she may work for different organizations, may act in different roles, and may inherit different privileges tied to those roles. The person may establish different personas for each organization or role and amass different attributes related to each persona. For example, an individual may work for Company A as a gate guard during the week and may work for Company B as a shift manager on the weekend. The subject attributes are different for each persona. Although trained and qualified as a Gate Guard for Company A, while operating in her Company B persona as a shift manager on the weekend she does not have the authority to perform as a Gate Guard for Company B.
…(emphasis in the original)

Clearly NIST recognizes that subjects, at least in the sense of people, are identified by a set of “subject attributes” that uniquely identify that subject. It doesn’t seem like much of a leap to recognize that for other subjects, including the attributes used to identify subjects.

I don’t know what other US government agencies have similar language but it sounds like a starting point for a robust discussion of topic maps and their advantages.

Yes?

Interactive Intent Modeling: Information Discovery Beyond Search

Wednesday, March 18th, 2015

Interactive Intent Modeling: Information Discovery Beyond Search by Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, Samuel Kaski.

From the post:

Combining intent modeling and visual user interfaces can help users discover novel information and dramatically improve their information-exploration performance.

Current-generation search engines serve billions of requests each day, returning responses to search queries in fractions of a second. They are great tools for checking facts and looking up information for which users can easily create queries (such as “Find the closest restaurants” or “Find reviews of a book”). What search engines are not good at is supporting complex information-exploration and discovery tasks that go beyond simple keyword queries. In information exploration and discovery, often called “exploratory search,” users may have difficulty expressing their information needs, and new search intents may emerge and be discovered only as they learn by reflecting on the acquired information. 8,9,18 This finding roots back to the “vocabulary mismatch problem” 13 that was identified in the 1980s but has remained difficult to tackle in operational information retrieval (IR) systems (see the sidebar “Background”). In essence, the problem refers to human communication behavior in which the humans writing the documents to be retrieved and the humans searching for them are likely to use very different vocabularies to encode and decode their intended meaning. 8,21

Assisting users in the search process is increasingly important, as everyday search behavior ranges from simple look-ups to a spectrum of search tasks 23 in which search behavior is more exploratory and information needs and search intents uncertain and evolving over time.

We introduce interactive intent modeling, an approach promoting resourceful interaction between humans and IR systems to enable information discovery that goes beyond search. It addresses the vocabulary mismatch problem by giving users potential intents to explore, visualizing them as directions in the information space around the user’s present position, and allowing interaction to improve estimates of the user’s search intents.

What!? All those years spend trying to beat users into learning complex search languages were in vain? Say it’s not so!

But, apparently it is so. All of the research on “vocabulary mismatch problem,” “different vocabularies to encode and decode their meaning,” has come back to bite information systems that offer static and author-driven vocabularies.

Users search best, no surprise, through vocabularies they recognize and understand.

I don’t know of any interactive topic maps in the sense used here but that doesn’t mean that someone isn’t working on one.

A shift in this direction could do wonders for the results of searches.

Chemical databases: curation or integration by user-defined equivalence?

Monday, March 16th, 2015

Chemical databases: curation or integration by user-defined equivalence? by Anne Hersey, Jon Chambers, Louisa Bellis, A. Patrícia Bento, Anna Gaulton, John P. Overington.

Abstract:

There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.

The authors identify a number of reasons why databases of chemical identifications have different structures recorded for the same chemicals. One problem is that there is no authoritative source for chemical structures so upon publication, authors publish those aspects most relevant to their interest. Or publish images and not machine readable representations of a chemical. To say nothing of the usual antics with simple names and their confusions. But there are software limitations, business rules and other sources of a multiplicity of chemical structures.

Suffice it to say that the authors make a strong case for why there are multiple structures for any given chemical now and why that is going to continue.

The author’s openly ask if it is time to ask users for their assistance in mapping this diversity of structures:

Is it now time to accept that however diligent database providers are, there will always be differences in structure representations and indeed some errors in the structures that cannot be fixed with a realistic level of resource? Should we therefore turn our attention to encouraging the use and development of tools that enable the mapping together of related compounds rather than concentrate our efforts on ever more curation?

You know my answer to that question.

What’s yours?

I first saw this in a tweet by John P. Overington.

Fifty Words for Databases

Saturday, March 7th, 2015

From the post:

Almost every human endeavour seems simple from a distance: even database deployment. Reality always comes as a shock, because the closer you get to any real task, the more you come to appreciate the skills that are necessary to accomplish it.

One of the big surprises I have when I attend developer conferences is to be told by experts how easy it is to take a database from development and turn it into a production system, and then implement the processes that allow it to be upgraded safely. Occasionally, I’ve been so puzzled that I’ve drawn the speakers to one side after the presentation to ask them for the details of how to do it so effortlessly, mentioning a few of the tricky aspects I’ve hit. Invariably, it soon becomes apparent from their answers that their experience, from which they’ve extrapolated, is of databases the size of a spreadsheet with no complicated interdependencies, compliance issues, security complications, high-availability mechanisms, agent tasks, alerting systems, complex partitioning, queuing, replication, downstream analysis dependencies and so on about which you, the readers, know more than I. At the vast international enterprise where I once worked in IT, we had a coded insult for such people: ‘They’ve catalogued their CD collection in a database’. Unfair, unkind, but even a huge well-used ‘Big Data’ database dealing in social media is a tame and docile creature compared with a heavily- used OLTP trading system where any downtime or bug means figures for losses where you have to count the trailing zeros. The former has unique problems, of course, but the two types of database are so different.

I wonder if the problem is one of language. Just as the English have fifty ways of describing rainfall, and the Inuit have many ways of describing pack ice, it is about time that we created the language for a variety of databases from a mild drizzle (‘It is a soft morning to be sure’) to a cloud-burst. Until anyone pontificating about the database lifecycle can give their audience an indication of the type of database they’re referring to, we will continue to suffer the sort of misunderstandings that so frustrate the development process. Though I’m totally convinced that the development culture should cross-pollinate far more with the science of IT operations, It will need more than a DevOps group-hug; it will require a change in the technical language so that it can accurately describe the rich variety of databases in operational use and their widely- varying requirements. The current friction is surely due more to misunderstandings on both sides, because it is so difficult to communicate these requirements. Any suggestions for suitable descriptive words for types of database? (emphasis added)

If you have “descriptive words” to suggest to Phil, comment on his post.

With the realization that your “descriptive words” may be different from my “descriptive words” for the same database or mean a different database altogether or have nothing to do with databases at all (when viewed by others).

Yes, I have been thinking about identifiers, again, and will start off the coming week with a new series of posts on subject identification. I hope to include a proposal for a metric of subject identification.

Everyone is an IA [Information Architecture]

Wednesday, February 25th, 2015

Everyone is an IA [Information Architecture] by Dan Ramsden.

From the post:

This is a post inspired by my talk from World IA Day. On the day I had 20 minutes to fill – I did a magic trick and talked about an imaginary uncle. This post has the benefit of an edit, but recreates the central argument – everyone makes IA.

Information architecture is everywhere, it’s a part of every project, every design includes it. But I think there’s often a perception that because it requires a level of specialization to do the most complicated types of IA, people are nervous about how and when they engage with it – no-one like to look out of their depth. And some IA requires a depth of thinking that deserves justification and explanation.

Even when you’ve built up trust with teams of other disciplines or clients, I think one of the most regular questions asked of an IA is probably, ‘Is it really that complicated?’ And if we want to be happier in ourselves, and spread happiness by creating meaningful, beautiful, wonderful things – we need to convince people that complex is different from complicated. We need to share our conviction that IA is a real thing and that thinking like an IA is probably one of the most effective ways of contributing to a more meaningful world.

But we have a challenge, IAs are usualy the minority. At the BBC we have a team of about 140 in UX&D, and IAs are the minority – we’re not quite 10%. It’s my job to work out how those less than 1 in 10 can be as effective as possible and have the biggest positive impact on the work we do and the experiences we offer to our audiences. I don’t think this is unique. A lot of the time IAs don’t work together, or there’s not enough IAs to work on every project that could benefit from an IA mindset, which is every project.

This is what troubled me. How could I make sure that it is always designed? My solution to this is simple. We become the majority. And because we can’t do that just by recruiting a legion of IAs we do it another way. We turn everyone in the team into an information architect.

Now this is a bit contentious. There’s legitimate certainty that IA is a specialism and that there are dangers of diluting it. But last year I talked about an IA mindset, a way of approaching any design challenge from an IA perspective. My point then was that the way we tend to think and therefore approach design challenges is usually a bit different from other designers. But I don’t believe we’re that special. I think other people can adopt that mindset and think a little bit more like we do. I think if we work hard enough we can find ways to help designers to adopt that IA mindset more regularly.

And we know the benefits on offer when every design starts from the architecture up. Well-architected things work better. They are more efficient, connected, resilient and meaningful – they’re more useful.

Dan goes onto say that information is everywhere. Much in the same way that I would say that subjects are everywhere.

Just as users must describe information architectures as they experience them, the same is true for users identifying the subjects that are important to them.

There is never a doubt that more IAs and more subjects exist, but the best anyone can do is to tell you about the ones that are important to them and how they have chosen to identify them.

To no small degree, I think terminology has been used to disenfranchise users from discussing subjects as they understand them.

From my own background, I remember a database project where the head of membership services, who ran reports by rote out of R&R, insisted on saying where data needed to reside in tables during a complete re-write of the database. I keep trying, with little success, to get them to describe what they wanted to store and what capabilities they needed.

In retrospect, I should have allowed membership services to use their terminology to describe the database because whether they understood the underlying data architecture or not wasn’t a design goal. The easier course would have been to provide them with a view that accorded with their idea of the database structure and to run their reports. That other “views” of the data existed would have been neither here nor there to them.

As “experts,” we should listen to the description of information architectures and/or identifications of subjects and their relationships as a voyage of discovery. We are discovering the way someone else views the world, not for our correction to the “right” way but so we can enable their view to be more productive and useful to them.

That approach takes more work on the part of “experts” but think of all the things you will learn along the way.

AMR: Not semantics, but close (? maybe ???)

Wednesday, November 5th, 2014

AMR: Not semantics, but close (? maybe ???) by Hal Daumé.

From the post:

Okay, necessary warning. I’m not a semanticist. I’m not even a linguist. Last time I took semantics was twelve years ago (sigh.)

Like a lot of people, I’ve been excited about AMR (the “Abstract Meaning Representation”) recently. It’s hard not to get excited. Semantics is all the rage. And there are those crazy people out there who think you can cram meaning of a sentence into a !#\$* vector [1], so the part of me that likes Language likes anything that has interesting structure and calls itself “Meaning.” I effluviated about AMR in the context of the (awesome) SemEval panel.

There is an LREC paper this year whose title is where I stole the title of this post from: Not an Interlingua, But Close: A Comparison of English AMRs to Chinese and Czech by Xue, Bojar, Hajič, Palmer, Urešová and Zhang. It’s a great introduction to AMR and you should read it (at least skim).

What I guess I’m interested in discussing is not the question of whether AMR is a good interlingua but whether it’s a semantic representation. Note that it doesn’t claim this: it’s not called ASR. But as semantics is the study of the relationship between signifiers and denotation, [Edit: it’s a reasonable place to look; see Emily Bender’s comment.] it’s probably the closest we have.

Deeply interesting work, particularly given the recent interest in Enhancing open data with identifiers. Be sure to read the comments to the post as well.

Who knew? Semantics are important!

😉

Topic maps take that a step further and capture your semantics, not necessarily the semantics of some expert unfamiliar with your domain.

Data Modelling: The Thin Model [Entities with only identifiers]

Monday, October 27th, 2014

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

What’s in a Name?

Wednesday, September 10th, 2014

What’s in a Name?

From the webpage:

What will be covered? The meeting will focus on the role of chemical nomenclature and terminology in open innovation and communication. A discussion of areas of nomenclature and terminology where there are fundamental issues, how computer software helps and hinders, the need for clarity and unambiguous definitions for application to software systems. How can you contribute? As well as the talks from expert speakers there will be plenty of opportunity for discussion and networking. A record will be made of the meeting, including the discussion, and will be made available initially to those attending the meeting. The detailed programme and names of speakers will be available closer to the date of the meeting.

Date: 21 October 2014

Event Subject(s): Industry & Technology

Venue

The Royal Society of Chemistry
Library
Burlington House
London
W1J 0BA
United Kingdom

Find this location using Google Map

Contact for Event Information

Name: Prof Jeremy Frey

Chemistry
University of Southampton
United Kingdom

Email: j.g.frey@soton.ac.uk

Now there’s an event worth the hassle of overseas travel during these paranoid times! Alas, I will have to wait for the conference record to be released to non-attendees. The event is a good example of the work going on at the Royal Society of Chemistry.

I first saw this in a tweet by Open PHACTS.

Sunday, July 20th, 2014

From the webpage:

The German Record Linkage Center (GermanRLC) was established in 2011 to promote research on record linkage and to facilitate practical applications in Germany. The Center will provide several services related to record linkage applications as well as conduct research on central topics of the field. The services of the GermanRLC are open to all academic disciplines.

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.

While very similar to topic maps, record linkage relies upon the creation of a common record for further processing, as opposed to pointing into an infoverse to identify subjects in their natural surroundings.

Another difference in practice is that the subjects (headers, fields, etc.) that contain subjects are not themselves treated as subjects with identity. That is to say that how a mapping from an original form was made to the target form is opaque to a subsequent researcher.

I first saw this in a tweet by Lars Marius Garshol.

Friday, July 18th, 2014

Build Roads not Stagecoaches by Martin Fenner.

Describing Eric Hysen’s keynote, Martin says:

In his keynote he described how travel from Cambridge to London in the 18th and early 19th century improved mainly as a result of better roads, made possible by changes in how these roads are financed. Translated to today, he urged the audience to think more about the infrastructure and less about the end products:

Ecosystems, not apps

— Eric Hysen

On Tuesday at csv,conf, Nick Stenning – Technical Director of the Open Knowledge Foundation – talked about data packages, an evolving standard to describe data that are passed around betwen different systems. He used the metaphor of containers, and how they have dramatically changed the transportation of goods in the last 50 years. He argued that the cost of shipping was in large part determined by the cost of loading and unloading, and the container has dramatically changed that equation. We are in a very similar situation with datasets, where most of the time is spent translating between different formats, joining things together that use different names for the same thing [emphasis added], etc.

…different names for the same thing.

Have you heard that before? 😉

But here is the irony:

When I thought more about this I realized that these building blocks are exactly the projects I get most excited about, i.e. projects that develop standards or provide APIs or libraries. Some examples would be

• ORCID: unique identifiers for scholarly authors

OK, but many authors already have unique identifiers in DBLP, Library of Congress, Twitter, and at places I have not listed.

Nothing against ORCID, but adding yet another identifier isn’t all that helpful.

A mapping between identifiers, so having one means I can leverage the others, now that is what I call infrastructure.

You?

Communicating and resolving entity references

Friday, June 27th, 2014

Communicating and resolving entity references by R.V. Guha.

Abstract:

Statements about entities occur everywhere, from newspapers and web pages to structured databases. Correlating references to entities across systems that use different identifiers or names for them is a widespread problem. In this paper, we show how shared knowledge between systems can be used to solve this problem. We present “reference by description”, a formal model for resolving references. We provide some results on the conditions under which a randomly chosen entity in one system can, with high probability, be mapped to the same entity in a different system.

An eye appointment is going to prevent me from reading this paper closely today.

From a quick scan, do you think Guha is making a distinction between entities and subjects (in the topic map sense)?

What do you make of literals having no identity beyond their encoding? (page 4, #3)

Redundant descriptions? (page 7) Would you say that defining a set of properties that must match would qualify? (Or even just additional subject indicators?)

Expect to see a lot more comments on this paper.

Enjoy!

I first saw this in a tweet by Stefano Bertolo.

What You Thought The Supreme Court…

Sunday, June 15th, 2014

Clever piece of code exposes hidden changes to Supreme Court opinions by Jeff John Roberts.

From the post:

Supreme Court opinions are the law of the land, and so it’s a problem when the Justices change the words of the decisions without telling anyone. This happens on a regular basis, but fortunately a lawyer in Washington appears to have just found a solution.

The issue, as Adam Liptak explained in the New York Times, is that original statements by the Justices about everything from EPA policy to American Jewish communities, are disappearing from decisions — and being replaced by new language that says something entirely different. As you can imagine, this is a problem for lawyers, scholars, journalists and everyone else who relies on Supreme Court opinions.

Until now, the only way to detect when a decision has been altered is a pain-staking comparison of earlier and later copies — provided, of course, that someone knew a decision had been changed in the first place. Thanks to a simple Twitter tool, the process may become much easier.

See Jeff’s post for more details, including a twitter account to follow the discovery of changes in opinions in the opinions of the Supreme Court of the United States.

In a nutshell, the court issues “slip” opinions in cases they decide and then later, sometimes years later, they provide a small group of publishers of their opinions with changes to be made to those opinions.

Which means the opinion you read as a “slip” opinion or in an advance sheet (paper back issue that is followed by a hard copy volume combining one or more advance sheets), may not be the opinion of record down the road.

Two questions occur to me immediately:

1. We can distinguish the “slip” opinion version of an opinion from the “final” published opinion, but how do we distinguish a “final” published decision from a later “more final” published decision? Given the stakes at hand in proceedings before the Supreme Court, certainty about the prior opinions of the Court is very important.
2. While the Supreme Court always gets most of the attention, it occurs to me that the same process of silent correction has been going on for other courts with published opinions, such as the United States Courts of Appeal and the United States District Courts. Perhaps for the last century or more.

Which makes it only a small step to ask about state supreme courts and their courts of appeal. What is their record on silent correction of opinions?

There are mechanical difficulties the older records become because the “slip” opinions may be lost to history but in terms of volume, that would certainly be a “big data” project for legal informatics. To discover and document the behavior of courts over time with regard to silent correction of opinions.

What you thought the Supreme Court said may not be what our current record reflects. Who wins? What you heard or what a silently corrected record reports?

Emotion Markup Language 1.0 (No Repeat of RDF Mistake)

Sunday, May 25th, 2014

Emotion Markup Language (EmotionML) 1.0

Abstract:

As the Web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. The specification of Emotion Markup Language 1.0 aims to strike a balance between practical applicability and scientific well-foundedness. The language is conceived as a “plug-in” language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

I started reading EmotionML with the expectation that the W3C had repeated its one way and one way only for identification mistake from RDF.

Much to my pleasant surprise I found:

1.2 The challenge of defining a generally usable Emotion Markup Language

Any attempt to standardize the description of emotions using a finite set of fixed descriptors is doomed to failure: even scientists cannot agree on the number of relevant emotions, or on the names that should be given to them. Even more basically, the list of emotion-related states that should be distinguished varies depending on the application domain and the aspect of emotions to be focused. Basically, the vocabulary needed depends on the context of use. On the other hand, the basic structure of concepts is less controversial: it is generally agreed that emotions involve triggers, appraisals, feelings, expressive behavior including physiological changes, and action tendencies; emotions in their entirety can be described in terms of categories or a small
number of dimensions; emotions have an intensity, and so on. For details, see Scientific Descriptions of Emotions in the Final Report of the Emotion Incubator Group.

Given this lack of agreement on descriptors in the field, the only practical way of defining an EmotionML is the definition of possible structural elements and their valid child elements and attributes, but to allow users to “plug in” vocabularies that they consider appropriate for their work. A separate W3C Working Draft complements this specification to provide a central repository of [Vocabularies for EmotionML] which can serve as a starting point; where the vocabularies listed there seem inappropriate, users can create their custom vocabularies.

An additional challenge lies in the aim to provide a generally usable markup, as the requirements arising from the three different use cases (annotation, recognition, and generation) are rather different. Whereas manual annotation tends to require all the fine-grained distinctions considered in the scientific literature, automatic recognition systems can usually distinguish
only a very small number of different states.

For the reasons outlined here, it is clear that there is an inevitable tension between flexibility and interoperability, which need to be weighed in the formulation of an EmotionML. The guiding principle in the following specification has been to provide a choice only where it is needed, and to propose reasonable default options for every choice.

Everything that is said about emotions is equally true for identification, emotions being on one of the infinite sets of subjects that you might want to identify.

Had the W3C avoided the one identifier scheme of RDF (and the reliance on a subset of reasoning, logic), RDF could have had plugin “identifier” modules, enabling the use of all extant and future identifiers, not to mention “reasoning” according to the designs of users.

It is good to see the W3C learning from its earlier mistakes and enabling users to express their world views, as opposed to a world view as prescribed by the W3C.

When users declare their emotional vocabularies, those are subjects which merit further identification. To avoid the problem of us not meaning the same thing by “owl:sameAs” as someone else means by “owl:sameAs.” (When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web by Harry Halpin, Ivan Herman, Patrick J. Hayes.)

Topic maps are a good solution for documenting subject identity and deciding when two or more identifications of subjects are the same subject.

I first saw this in a tweet by Inge Henriksen

…Locality Sensitive Hashing for Unstructured Data

Friday, May 9th, 2014

From the post:

The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the search time typically required in high dimensional spaces when finding similar items. Locality Sensitive Hashing accomplishes this efficiency by exponentially reducing the amount of data required for storage when collecting features for comparison between similar item sets. In other words, Locality Sensitive Hashing successfully reduces a high dimensional feature space while still retaining a random permutation of relevant features which research has shown can be used between data sets to determine an accurate approximation of Jaccard similarity [2,3].

Complete with code and references no less!

How “similar” do two items need to be to count as the same item?

If two libraries own a physical copy of the same book, for some purposes they are distinct items but for annotations/reviews, you could treat them as one item.

If that sounds like a topic map-like question, your right!

What measures of similarity are you applying to what subjects?

We have no “yellow curved fruit” today

Thursday, April 24th, 2014

Tweeted by Olivier Croisier with this comment:

Looks like naming things is hard not only in computer science…

Naming (read identity) problems are everywhere.

Our intellectual cocoons prevent us noticing such problems very often.

At least until something goes terribly wrong. Then the hunt is on for a scapegoat, not an explanation.