Archive for the ‘Subject Identity’ Category

Information organization and the philosophy of history

Tuesday, May 14th, 2013

Information organization and the philosophy of history by Ryan Shaw. (Shaw, R. (2013), Information organization and the philosophy of history. J. Am. Soc. Inf. Sci., 64: 1092–1103. doi: 10.1002/asi.22843)

Abstract:

The philosophy of history can help articulate problems relevant to information organization. One such problem is “aboutness”: How do texts relate to the world? In response to this problem, philosophers of history have developed theories of colligation describing how authors bind together phenomena under organizing concepts. Drawing on these ideas, I present a theory of subject analysis that avoids the problematic illusion of an independent “landscape” of subjects. This theory points to a broad vision of the future of information organization and some specific challenges to be met.

You are unlikely to find this article directly actionable in your next topic map project.

On the other hand, if you enjoy the challenge of thinking about how we think, you will find it a real treat.

Shaw writes:

Different interpretive judgments result in overlapping and potentially contradictory organizing principles. Organizing systems ought to make these overlappings evident and show the contours of differences in perspective that distinguish individual judgments. Far from providing a more “complete” view of a static landscape, organizing systems should multiply and juxtapose views. As Geoffrey Bowker (2005) has argued,

the goal of metadata standards should not be to produce a convergent unity. We need to open a discourse—where there is no effective discourse now—about the varying temporalities, spatialities and materialities that we might represent in our databases, with a view to designing for maximum flexibility and allowing as much as possible for an emergent polyphony and polychrony. (pp. 183–184)

The demand for polyphony and polychrony leads to a second challenge, which is to find ways to open the construction of organizing systems to wider participation. How might academics, librarians, teachers, public historians, curators, archivists, documentary editors, genealogists, and independent scholars all contribute to a shared infrastructure for linking and organizing historical discourse through conceptual models? If this challenge can be addressed, the next generation of organizing systems could provide the infrastructure for new kinds of collaborative scholarship and organizing practice.

Once upon a time, you could argue that physical limitations of cataloging systems meant that a single classification system (convergent unity) was necessary for systems to work at all.

But that was an artifact of the physical medium of the catalog.

The deepest irony of the digital age is continuation of the single classification system requirement, a requirement past its discard date.

Construction of Controlled Vocabularies

Tuesday, April 2nd, 2013

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

Example:
salinity/saltiness
  VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

Example:
Mercury (planet)
  Mercury (metal)
  Mercury (automobile)
  Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

The Next 700 Programming Languages
[Essence of Topic Maps]

Saturday, March 16th, 2013

The Next 700 Programming Languages by P. J. Landin.

ABSTRACT:

A family of unimplemented computing languages is described that is intended to span differences of application area by a unified framework. This framework dictates the rules about the uses of user-coined names, and the conventions about characterizing functional relationships. Within this framework ‘lhe design of a specific language splits into two independent parts. One is the choice of written appearances of programs (or more generally, their physical representation). The other is the choice of the abstract entities (such as numbers, character-strings, lists of them, functional relations among them) that can be referred to in the language.

The system is biased towards “expressions” rather than “statements.” It includes a nonprocedural (purely functional) subsystem that aims to expand the class of users’ needs that can be met by a single print-instruction, without sacrificing the important properties that make conventional right-hand-side expressions easy to construct and understand.

The introduction to this paper reminded me of an acronym, SWIM (See What I Mean) that was coined to my knowledge by Michel Biezunski several years ago:

Most programming languages are partly a way of expressing things in terms of other things and partly a basic set of given things. The ISWIM (If you See What I Mean) system is a byproduct of an attempt to disentangle these two aspects in some current languages.

This attempt has led the author to think that many linguistic idiosyncracies are concerned with the former rather than the latter, whereas aptitude for a particular class of tasks is essentially determined by the latter rather than the former. The conclusion follows that many language characteristics are irrelevant to the alleged problem orientation.

ISWIM is an attempt at a general purpose system for describing things in terms of other things, that can be problem-oriented by appropriate choice of “primitives.” So it is not a language so much as a family of languages, of which each member is the result of choosing a set of primitives. The possibilities concerning this set and what is needed to specify such a set are discussed below.

The essence of topic maps is captured by:

ISWIM is an attempt at a general purpose system for describing things in terms of other things, that can be problem-oriented by appropriate choice of “primitives.”

Every information system has a set of terms, the meaning of which are known to its designers and/or users.

Data integration issues arise from the description of terms, “in terms of other things,” being known only to designers and users.

The power of topic maps comes from the expression of descriptions “in terms of other things,” for terms.

Other designers or users can examine those descriptions to see if they recognize any terms similar to those they know by other descriptions.

If they discover descriptions they consider to be of same thing, they can then create a mapping of those terms.

Hopefully using the descriptions as a basis for the mapping. A mapping of term to term only multiplies the opaqueness of the terms.

For some systems, Social Security Administration databases for example, descriptions of terms “in terms of other things” may not be part of the database itself. But descriptions maintained as “best practice” to facilitate later maintenance and changes.

For other systems, U.S. Intelligence community as another example, still chasing the will-o’-the-wisp* of standard terminology for non-standard terms, even the possibility of interchange depends on the development of description of terms “in terms of other things.”

Before you ask, yes, yes the Topic Maps Data Model (TMDM) and the various Topic Maps syntaxes are terms that can be described “in terms of other things.”

The advantage of the TMDM and relevant syntaxes is that even if not described “in terms of other things,” standardized terms enable interchange of a class of mappings. The default identification mapping in the TMDM being by IRIs.

Before and since Landin’s article we have been producing terms that could be described “in terms of other things.” In CS and other areas of human endeavor as well.

Isn’t it about time we starting describing our terms rather than clamoring for one set of undescribed terms or another?


* I use the term will-o’-the-wisp quite deliberately.

After decades of failure to create universal information systems with computers, following on centuries of non-computer failures to reach the same goal, following on millennia of semantic and linguistic diversity, someone knows attempts at universal information systems will leave intelligence agencies not sharing critical data.

Perhaps the method you choose says a great deal about the true goals of your project.

I first saw this in a tweet by CompSciFact.

Onomastics 2.0 – The Power of Social Co-Occurrences

Monday, March 11th, 2013

Onomastics 2.0 – The Power of Social Co-Occurrences by Folke Mitzlaff, Gerd Stumme.

Abstract:

Onomastics is “the science or study of the origin and forms of proper names of persons or places.” ["Onomastics". Merriam-Webster.com, 2013. this http URL (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste.

With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names.

The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed.

The discovered relations among given names are the foundation of “nameling” [this http URL], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.

Interesting work on the co-occurrence of names.

Chosen names in this case but I wonder if the same would be true for false names?

Are there patterns to false names chosen by actors who are attempting to conceal their identities?

I first saw this in a tweet by Stefano Bertolo.

VIAF: The Virtual International Authority File

Wednesday, March 6th, 2013

VIAF: The Virtual International Authority File

From the webpage:

VIAF, implemented and hosted by OCLC, is a joint project of several national libraries plus selected regional and trans-national library agencies. The project’s goal is to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web.

The “about” link at the bottom of the page is broken (in the English version). A working “about” link for VIAF reports:

At a glance

  • A collaborative effort between national libraries and organizations contributing name authority files, furthering access to information
  • All authority data for a given entity is linked together into a “super” authority record
  • A convenient way for the library community and other agencies to repurpose bibliographic data produced by libraries serving different language communities

The Virtual International Authority File (VIAF) is an international service designed to provide convenient access to the world’s major name authority files. Its creators envision the VIAF as a building block for the Semantic Web to enable switching of the displayed form of names for persons to the preferred language and script of the Web user. VIAF began as a joint project with the Library of Congress (LC), the Deutsche Nationalbibliothek (DNB), the Bibliothèque nationale de France (BNF) and OCLC. It has, over the past decade, become a cooperative effort involving an expanding number of other national libraries and other agencies. At the beginning of 2012, contributors include 20 agencies from 16 countries.

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF helps to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF matches and links the authority files of national libraries and groups all authority records for a given entity into a merged “super” authority record that brings together the different names for that entity. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities.

If you were to substitute for ‘”super” authority record,” the term topic, you would be part of the way towards a topic map.

Topics gather information about a given entity into a single location.

Topics differ from the authority records you find at VIAF in two very important ways:

  1. First, topics, unlike authority records, have the ability to merge with other topics, creating new topics that have more information than any of the original topics.
  2. Second, authority records are created by, well, authorities. Do you see your name or the name of your organization on the list at VIAF? Topics can be created by anyone and merged with other topics on terms chosen by the possessor of the topic map. You don’t have to wait for an authority to create the topic or approve your merging of it.

There are definite advantages to having authorities and authority records, but there are also advantages to having the freedom to describe your world, in your terms.

Hellerstein: Humans are the Bottleneck [Not really]

Saturday, March 2nd, 2013

Hellerstein: Humans are the Bottleneck by Isaac Lopez.

From the post:

Humans are the bottleneck right now in the data space, commented database systems luminary, Joe Hellerstein during an interview this week at Strata 2013.

“As Moore’s law drives the cost of computing down, and as data becomes more prevalent as a result, what we see is that the remaining bottleneck in computing costs is the human factor,” says Hellerstein, one of the fathers of adaptive query processing and a half dozen other database technologies.

Hellerstein says that recent research studies conducted at Stanford and Berkeley have found that 50-80 percent of a data analyst’s time is being used for the data grunt work (with the rest left for custom coding, analysis, and other duties).

“Data prep, data wrangling, data munging are words you hear over and over,” says Hellerstein. “Even with very highly skilled professionals in the data analysis space, this is where they’re spending their time, and it really is a big bottleneck.”

Just because humans gather at a common location, in “data prep, data wrangling, data munging,” doesn’t mean they “are the bottleneck.”

The question to ask is: Why are people spending so much time at location X in data processing?

Answer: poor data quality and/or rather the inability of machines to process effectively data from different origins. That’s the bottleneck.

A problem that management of subject identities for data and its containers is uniquely poised to solve.

G2 | Sensemaking – Two Years Old Today

Sunday, February 3rd, 2013

G2 | Sensemaking – Two Years Old Today by Jeff Jonas.

From the post:

What is G2?

When I speak about Context Accumulation, Data Finds Data and Relevance Finds You, and Sensemaking I am describing various aspects of G2.

In simple terms G2 software is designed to integrate diverse observations (data) as it arrives, in real-time.  G2 does this incrementally, piece by piece, much in the same way you would put a puzzle together at home.  And just like at home, the more puzzle pieces integrated into the puzzle, the more complete the picture.  The more complete the picture, the better the ability to make sense of what has happened in the past, what is happening now, and what may come next.  Users of G2 technology will be more efficient, deliver high quality outcomes, and ultimately will be more competitive.

Early adopters seem to be especially interested in one specific use case: Using G2 to help organizations better direct the attention of its finite workforce.  With the workforce now focusing on the most important things first, G2 is then used to improve the quality of analysis while at the same time reducing the amount of time such analysis takes.  The bigger the organization, the bigger the observation space, the more essential sensemaking is.

About Sensemaking

One of the things G2 can already do pretty darn well – considering she just turned two years old – is ”Sensemaking.”  Imagine a system capable of paying very close attention to every observation that comes its way.  Each observation incrementally improving upon the picture and using this emerging picture in real-time to make higher quality business decisions; for example, the selection of the perfect ad for a web page (in sub-200 milliseconds as the user navigates to the page) or raising an alarm to a human for inspection (an alarm sufficiently important to be placed top of the queue).  G2, when used this way, enables Enterprise Intelligence.

Of course there is no magic.  Sensemaking engines are limited by their available observation space.  If a sentient being would be unable to make sense of the situation based on the available observation space, neither would G2.  I am not talking about Fantasy Analytics here.

I would say “subject identity” instead of “sensemaking” and after reading Jeff’s post, consider them to be synonyms.

Read the section General Purpose Context Accumulation very carefully.

As well as “Privacy by Design (PbD).”

BTW, G2 uses Universal Message Format XML for input/output.

Not to argue from authority but Jeff is one of only 77 active IBM Research Fellows.

Someone to listen to, even if we may disagree on some of the finer points.

Making Sense of Others’ Data Structures

Sunday, February 3rd, 2013

Making Sense of Others’ Data Structures by Eruditio Loginquitas.

From the post:

Coming in as an outsider to others’ research always requires an investment of time and patience. After all, how others conceptualize their fields, and how they structure their questions and their probes, and how they collect information, and then how they represent their data all reflect their understandings, their theoretical and analytical approaches, their professional training, and their interests. When professionals collaborate, they will approach a confluence of understandings and move together in a semi-united way. Individual researchers—not so much. But either way, for an outsider, there will have to be some adjustment to understand the research and data. Professional researchers strive to control for error and noise at every stage of the research: the hypothesis, literature review, design, execution, publishing, and presentation.

Coming into a project after the data has been collected and stored in Excel spreadsheets means that the learning curve is high in yet another way: data structures. While the spreadsheet itself seems pretty constrained and defined, there is no foregone conclusion that people will necessarily represent their data a particular way.

Data structures as subjects. What a concept! ;-)

Data structures, contrary to some, are not self-evident or self-documenting.

Not to mention that like ourselves, are in a constant state of evolution as our understanding or perception of data changes.

Mine is not the counsel of despair, but of encouragement to consider the costs/benefits of capturing data structure subject identities just as more traditional subjects.

It may be costs or other constraints prevent such capture but you may also miss benefits if you don’t ask.

How much did it cost for each transition in episodic data governance efforts to re-establish data structure subject identities?

Could be that more money spent now would get an enterprise off the perpetual cycle of data governance.

ToxPi GUI [Data Recycling]

Sunday, February 3rd, 2013

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)

Abstract:

Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.

Contact: reif.david@gmail.com

Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

Music Network Visualization

Tuesday, December 11th, 2012

Music Network Visualization by Dimiter Toshkov.

From the post:

My music interests have always been rather, hmm…, eclectic. Somehow IDM, ambient, darkwave, triphop, acid jazz, bossa nova, qawali, Mali blues and other more or less obscure genres have managed to happily co-exist in my music collection. The sheer diversity always invited the question whether there is some structure to the collection, or each genre is an island of its own. Sounds like a job for network visualization!

Now, there are plenty of music network viz applications on the web. But they don’t show my collection, and just seem unsatisfactory for various reasons. So I decided to craft my own visualization using R and igraph.

Interesting for the visualization but also the use of similarity measures.

The test for identity of a subject, particularly collective subjects, artists “similar” to X, is as unlimited as your imagination.

Detecting Communities in Social Graph [Communities of Representatives?]

Thursday, November 29th, 2012

Detecting Communities in Social Graph by Ricky Ho.

From the post:

In analyzing social network, one common problem is how to detecting communities, such as groups of people who knows or interacting frequently with each other. Community is a subgraph of a graph where the connectivity are unusually dense.

In this blog, I will enumerate some common algorithms on finding communities.

First of all, community detection can be think of graph partitioning problem. In this case, a single node will belong to no more than one community. In other words, community does not overlap with each other.

When you read:

community detection can be think of graph partitioning problem. In this case, a single node will belong to no more than one community.

What does that remind you of?

Does it stand to reason that representatives of the same subject, some with more, some with less information about a subject, would exhibit the same “connectivity” that Ricky calls “unusually dense?”

The TMDM defines a basis for “unusually dense” connectivity but what if we are exploring other representatives of subjects? And trying to detect likely representatives of the same subject?

How would you use graph partitioning to explore such representative?

That could make a fairly interesting research project for anyone wanting to merge diverse intelligence about some subject or person together.

SunPy [Choosing Specific Subject Identity Issues]

Tuesday, November 27th, 2012

SunPy: A Community Python Library for Solar Physics

From the homepage:

The SunPy project is an effort to create an open-source software library for solar physics using the Python programming language.

As you have seen in your own experience or read about in my other posting on astronomical data, like elsewhere, subject identity issues abound.

This is another area that may spark someone’s interest in using topic maps to mitigate against specific subject identity issues.

“Specific subject identity issues” because the act of mitigation always creates more subjects which could be the sources of subject identity issues. It’s not a problem so long as you choose the issues most important to you.

If and when those other potential subject identity issues become relevant, they can be addressed later. The logic approach pretends such issues don’t exist at all. I prefer the former. It’s less fragile.

Psychological Studies of Policy Reasoning

Monday, November 19th, 2012

Psychological Studies of Policy Reasoning by Adam Wyner.

From the post:

The New York Times had an article on the difficulties that the public has to understand complex policy proposals – I’m Right (For Some Reason). The points in the article relate directly to the research I’ve been doing at Liverpool on the IMPACT Project, for we decompose a policy proposal into its constituent parts for examination and improved understanding. See our tool live: Structured Consultation Tool

Policy proposals are often presented in an encapsulated form (a sound bite). And those receiving it presume that they understand it, the illusion of explanatory depth discussed in a recent article by Frank Keil (a psychology professor at Cornell when and where I was a Linguistics PhD student). This is the illusion where people believe they understand a complex phenomena with greater precision, coherence, and depth than they actually do; they overestimate their understanding. To philosophers, this is hardly a new phenomena, but showing it experimentally is a new result.

In research about public policy, the NY Times authors, Sloman and Fernbach, describe experiments where people state a position and then had to justify it. The results showed that participants softened their views as a result, for their efforts to justify it highlighted the limits of their understanding. Rather than statements of policy proposals, they suggest:

An approach to get people to state how they would distinguish or not, two subjects?

Would it make a difference if the questions were oral or in writing?

Since a topic map is an effort to capture a domain expert’s knowledge, tools to elicit that knowledge are important.

MDM: It’s Not about One Version of the Truth

Wednesday, October 31st, 2012

MDM: It’s Not about One Version of the Truth by Michele Goetz.

From the post:

Here is why I am not a fan of the “single source of truth” mantra. A person is not one-dimensional; they can be a parent, a friend, a colleague and each has different motivations and requirements depending on the environment. A product is as much about the physical aspect as it is the pricing, message, and sales channel it is sold through. Or, it is also faceted by the fact that it is put together from various products and parts from partners. In no way is a master entity unique or has a consistency depending on what is important about the entity in a given situation. What MDM provides are definitions and instructions on the right data to use in the right engagement. Context is a key value of MDM.

When organizations have implemented MDM to create a golden record and single source of truth, domain models are extremely rigid and defined only within a single engagement model for a process or reporting. The challenge is the master entity is global in nature when it should have been localized. This model does not allow enough points of relationship to create the dimensions needed to extend beyond the initial scope. If you want to now extend, you need to rebuild your MDM model. This is essentially starting over or you ignore and build a layer of redundancy and introduce more complexity and management.

The line:

The challenge is the master entity is global in nature when it should have been localized.

stopped me cold.

What if I said:

“The challenge is a subject proxy is global in nature when it should have been localized.”

Would your reaction be the same?

Shouldn’t subject identity always be local?

Or perhaps better, have you ever experienced a subject identification that wasn’t local?

We may talk about a universal notion of subject but even so we are using a localized definition of universal subject.

If a subject proxy is a container for local identifications, thought to be identifications of the same subject, need we be concerned if it doesn’t claim to be a universal representative for some subject? Or is it sufficient that it is a faithful representative of one or more identifications, thought by some collector to identify the same subject?

I am leaning towards the latter because it jettisons the doubtful baggage of universality.

That is a subject may have more than one collection of local identifications (such collections being subject proxies), none of which is the universal representative for that subject.

Even if we think another collection represents the same subject, merging those collections is a question of your requirements.

You may not want to collect Twitter comments in Hindi about Glee.

Your topic map, your requirements, your call.

PS: You need to read Michele’s original post to discover what could entice management to fund an MDM project. Interoperability of data isn’t it.

Visual Clues: A Brain “feature,” not a “bug”

Saturday, September 29th, 2012

You will read in When Your Eyes Tell Your Hands What to Think: You’re Far Less in Control of Your Brain Than You Think that:

You’ve probably never given much thought to the fact that picking up your cup of morning coffee presents your brain with a set of complex decisions. You need to decide how to aim your hand, grasp the handle and raise the cup to your mouth, all without spilling the contents on your lap.

A new Northwestern University study shows that, not only does your brain handle such complex decisions for you, it also hides information from you about how those decisions are made.

“Our study gives a salient example,” said Yangqing ‘Lucie’ Xu, lead author of the study and a doctoral candidate in psychology at Northwestern. “When you pick up an object, your brain automatically decides how to control your muscles based on what your eyes provide about the object’s shape. When you pick up a mug by the handle with your right hand, you need to add a clockwise twist to your grip to compensate for the extra weight that you see on the left side of the mug.

“We showed that the use of this visual information is so powerful and automatic that we cannot turn it off. When people see an object weighted in one direction, they actually can’t help but ‘feel’ the weight in that direction, even when they know that we’re tricking them,” Xu said. (emphasis added)

I never quite trusted my brain and now I have proof that it is untrustworthy. Hiding stuff indeed! ;-)

But that’s the trick of subject identification/identity isn’t it?

That our brains “recognize” all manner of subjects without any effort on our part.

Another part of the effortless features of our brains. But it hides the information we need to integrate information stores from ourselves and others.

Or rather, making it more work than we are usually willing to devote to digging it out.

When called upon to be “explicit” about subject identification, or even worse, to imagine how other people identify subjects, we prefer to stay at home consuming passive entertainment.

Two quick points:

First, need to think about how to incorporate this “feature” into delivery interfaces for users.

Second, what subjects would users pay others to mine/collate/identify for them? (Delivery being a separate issue.)

Pushing Parallel Barriers Skyward (Subject Identity at 1EB/year)

Wednesday, September 12th, 2012

Pushing Parallel Barriers Skyward by Ian Armas Foster

From the post:

As much data as there exists on the planet Earth, the stars and the planets that surround them contain astronomically more. As we discussed earlier, Peter Nugent and the Palomar Transient Factory are using a form of parallel processing to identify astronomical phenomena.

Some researchers believe that parallel processing will not be enough to meet the huge data requirements of future massive-scale astronomical surveys. Specifically, several researchers from the Korea Institute of Science and Technology Information including Jaegyoon Hahm along with Yongsei University’s Yong-Ik Byun and the University of Michigan’s Min-Su Shin wrote a paper indicating that the future of astronomical big data research is brighter with cloud computing than parallel processing.

Parallel processing is holding its own at the moment. However, when these sky-mapping and phenomena-chasing projects grow significantly more ambitious by the year 2020, parallel processing will have no hope.

How ambitious are these future projects? According to the paper, the Large Synoptic Survey Telescope (LSST) will generate 75 petabytes of raw plus catalogued data for its ten years of operation, or about 20 terabytes a night. That pales in comparison to the Square Kilometer Array, which is projected to archive in one year 250 times the amount of information that exists on the planet today.

“The total data volume after processing (the LSST) will be several hundred PB, processed using 150 TFlops of computing power. Square Kilometer Array (SKA), which will be the largest in the world radio telescope in 2020, is projected to generate 10-100PB raw data per hour and archive data up to 1EB every year.”

Beyond storage/processing requirements, how do you deal with subject identity at 1EB/year?

Changing subject identity that is.

People are as inconstant with subject identity as they are with martial fidelity. If they do that well.

Now spread that over decades or centuries of research.

Does anyone see a problem here?

“how hard can this be?” (Data and Reality)

Saturday, September 8th, 2012

Books that Influenced my Thinking: Kent’s Data and Reality by Thomas Redman.

From the post:

It was the rumor that Steve Hoberman (Technics Publications) planned to reissue Data and Reality by William Kent that led me to use this space to review books that had influenced my thinking about data and data quality. My plan had been to do the review of Data and Reality as soon as it came out. I completely missed the boat – it has been out for some six months.

I first read Data and Reality as we struggled at Bell Labs to develop a definition of data that would prove useful for data quality. While I knew philosophers had debated the merits of various approaches for thousands of years, I still thought “how hard can this be?” About twenty minutes with Kent’s book convinced me. This is really tough.
….

Amazon reports Data and Reality (3rd edition) as 200 pages long.

Looking at a hard copy I see:

  • Prefaces 17-34
  • Chapter 1 Entities 35-54
  • Chapter 2 The Nature of an Information System 55-67
  • Chapter 3 Naming 69-86
  • Chapter 4 Relationships 87-98
  • Chapter 5 Attributes 99-107
  • Chapter 6 Types and Categories and Sets 109-117
  • Chapter 7 Models 119-123
  • Chapter 8 The Record Model 125-137
  • Chapter 9 Philosophy 139-150
  • Bibliography 151-159
  • Index 161-162

Way less than the 200 pages promised by Amazon.

To ask a slightly different question:

“How hard can it be” to teach building data models?

A hard problem with no fixed solution?

Suggestions?

A dynamic data structure for counting subgraphs in sparse graphs

Thursday, September 6th, 2012

A dynamic data structure for counting subgraphs in sparse graphs by Zdenek Dvorak and Vojtech Tuma.

Abstract:

We present a dynamic data structure representing a graph G, which allows addition and removal of edges from G and can determine the number of appearances of a graph of a bounded size as an induced subgraph of G. The queries are answered in constant time. When the data structure is used to represent graphs from a class with bounded expansion (which includes planar graphs and more generally all proper classes closed on topological minors, as well as many other natural classes of graphs with bounded average degree), the amortized time complexity of updates is polylogarithmic.

Work on data structures seems particularly appropriate when discussing graphs.

Subject identity, beyond string equivalent, can be seen as graph isomorphism or subgraph problem.

Has anyone proposed “bounded” subject identity mechanisms that correspond to the bounds necessary on graphs to make them processable?

We know how to do string equivalence and the “ideal” solution would be unlimited relationships to other subjects, but that is known to be intractable. For one thing we don’t know every relationship for any subject.

Thinking there may be boundary conditions for constructing subject identities that are more complex than string equivalence but that result in tractable identifications.

Suggestions?

Genetic algorithms: a simple R example

Saturday, August 4th, 2012

Genetic algorithms: a simple R example by Bart Smeets.

From the post:

Genetic algorithm is a search heuristic. GAs can generate a vast number of possible model solutions and use these to evolve towards an approximation of the best solution of the model. Hereby it mimics evolution in nature.

GA generates a population, the individuals in this population (often called chromosomes) have a given state. Once the population is generated, the state of these individuals is evaluated and graded on their value. The best individuals are then taken and crossed-over – in order to hopefully generate ‘better’ offspring – to form the new population. In some cases the best individuals in the population are preserved in order to guarantee ‘good individuals’ in the new generation (this is called elitism).

The GA site by Marek Obitko has a great tutorial for people with no previous knowledge on the subject.

As the size of data stores increase, the cost of personal judgement on each subject identity test will as well. Genetic algorithms may be one way of creating subject identity tests in such situations.

In any event, it won’t harm anyone to be aware of the basic contours of the technique.

I first saw this at R-Bloggers.

Optimal simultaneous superpositioning of multiple structures with missing data

Friday, July 20th, 2012

Optimal simultaneous superpositioning of multiple structures with missing data (Douglas L. Theobald and Phillip A. Steindel Optimal simultaneous superpositioning of multiple structures with missing data Bioinformatics 2012 28: 1972-1979. )

Abstract:

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

From the introduction:

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space. [emphasis added]

I have deep reservations about the representations of semantics using Cartesian metrics but in fact that happens quite frequently. And allegedly, usefully.

Leaving my doubts to one side, this superpositioning technique could prove to be a useful exploration technique.

If you experiment with this technique, a report of your experiences would be appreciated.

Elements of Software Construction [MIT 6.005]

Saturday, June 23rd, 2012

Elements of Software Construction

Description:

This course introduces fundamental principles and techniques of software development. Students learn how to write software that is safe from bugs, easy to understand, and ready for change.

Topics include specifications and invariants; testing, test-case generation, and coverage; state machines; abstract data types and representation independence; design patterns for object-oriented programming; concurrent programming, including message passing and shared concurrency, and defending against races and deadlock; and functional programming with immutable data and higher-order functions.

From the MIT OpenCourseware site.

Of interest to anyone writing topic map software.

It should also be of interest to anyone evaluating how software shapes what subjects we can talk about and how we can talk about them. Data structures have the same implications.

Not necessary to undertake such investigations in all cases. There are many routine uses for common topic map software.

Being able to see when the edges of a domain don’t quite fit or there may be gaps in coverage for an information system, are necessary skills for non-routine cases.

Why Your Brain Isn’t A Computer

Sunday, May 6th, 2012

Why Your Brain Isn’t A Computer by Alex Knapp.

Alex writes:

“If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.”
- Emerson M. Pugh

Earlier this week, i09 featured a primer, of sorts, by George Dvorsky regarding how an artificial human brain could be built. It’s worth reading, because it provides a nice overview of the philosophy that underlies some artificial intelligence research, while simultaneously – albeit unwittingly – demonstrating the some of the fundamental flaws underlying artificial intelligence research based on the computational theory of mind.

The computational theory of mind, in essence, says that your brain works like a computer. That is, it takes input from the outside world, then performs algorithms to produce output in the form of mental state or action. In other words, it claims that the brain is an information processor where your mind is “software” that runs on the “hardware” of the brain.

Dvorsky explicitly invokes the computational theory of mind by stating “if brain activity is regarded as a function that is physically computed by brains, then it should be possible to compute it on a Turing machine, namely a computer.” He then sets up a false dichotomy by stating that “if you believe that there’s something mystical or vital about human cognition you’re probably not going to put too much credence” into the methods of developing artificial brains that he describes.

I don’t normally read Forbes but I made and exception in this case and am glad I did.

Not that I particularly care about which side of the AI debate you come out on.

I do think that the notion of “emergent” properties is an important one for judging subject identities. Whether those subjects occur in text messages, intercepted phone calls, signal “intell” of any sort.

Properties that identify subjects “emerge” from a person who speaks the language in question, who has social/intellectual/cultural experiences that give them a grasp of the matters under discussion and perhaps the underlying intent of the parties to the conversation.

A computer program can be trained to mindlessly sort through large amounts of data. It can even be trained to acceptable levels of mis-reading, mis-interpretation.

What will our evaluation be when it misses the one conversation prior to another 9/11? Because the context or language was not anticipated? Because the connection would only emerge out of a living understanding of cultural context?

Computers are deeply useful, but not when emergent properties, emergent properties of the sort that identify subjects, targets and the like are at issue.

A new framework for innovation in journalism: How a computer scientist would do it

Tuesday, April 10th, 2012

A new framework for innovation in journalism: How a computer scientist would do it

Andrew Phelps writes:

What if journalism were invented today? How would a computer scientist go about building it, improving it, iterating it?

He might start by mapping out some fundamental questions: What are the project’s values and goals? What consumer needs would it satisfy? How much should be automated, how much human-powered? How could it be designed to be as efficient as possible?

Computer science Ph.D. Nick Diakopoulos has attempted to create a new framework for innovation in journalism. His new white paper, commissioned by CUNY’s Tow-Knight Center for Entrepreneurial Journalism, does not provide answers so much as a different way to come up with questions.

Diakopolous identified 27 computing concepts that could apply to journalism — think natural language processing, machine learning, game engines, virtual reality, information visualization — and pored over thousands of research papers to determine which topics get the most (and least) attention. (There are untapped opportunities in robotics, augmented reality, and motion capture, it turns out.)

He thinks computer science and journalism have a lot in common, actually. They are both fundamentally concerned with information. Acquiring it, storing it, modifying it, presenting it.

Suggest you read his paper in full: Cultivating the Landscape of Innovation in Computational Journalism.

Intrigued by the idea of gauging the opportunities along a continuum of activities. Could be a stunning visual of how subject identity is handled across activities and/or technologies.

Interested?

Once Upon A Subject Clearly…

Wednesday, March 28th, 2012

As I was writing up the GWAS Central post, the question occurred to me: does their mapping of identifiers take something away from topic maps?

My answer is no and I would like to say why if you have a couple of minutes. ;-) Seriously! It isn’t going to take that long. However long it has taken me to reach this point.

Every time we talk, write or otherwise communicate about a subject, we at the same time have identified that subject. Makes sense. We want whoever we are talking, writing to or communicating with, to understand what we are talking about. Hard to do if we don’t identify what subject(s) we are talking about.

We do it all day, every day. In public, in private, in semi-public places. ;-) And we use words to do it. To identify the subjects we are talking about.

For the most part, or at least fairly often, we are understood by other people. Not always, but most of the time.

The problem comes in when we start to gather up information from different people who may (or may not) use words differently than we do. So there is a much larger chance that we don’t mean the same thing by the same words. Or we may use different words to mean the same thing.

Words, which were our reliable servants for the most part, become far less reliable.

To counter that unreliability, we can create groups of words, mappings if you like, to keep track of what words go where. But, to do that, we have to use words, again.

Start to see the problem? We always use words, to clear up our difficulties with words. And there isn’t any universal stopping place. The Cyc advocates would have us stop there and the SUMO crowd would have us stop over there and the Semantic Web folks yet somewhere else and of course the topic map mavens, yet one or more places.

For some purposes, any one or more of those mappings may be adequate. A mapping is only as good and for as long as it is useful.

History tells us that every mapping will be replaced with other mappings. We would do well us understand/document the words we are using as part of our mappings, as well as we are able.

But if words are used to map words, where do we stop? My suggestion would be to stop as we always have, wherever looks convenient. So long as the mapping suits your present purposes, what more would you ask of it?

I am quite content to have such stopping places because it means we will always have more starting places for the next round of mapping!

Ironic isn’t it? We create mappings to make sense out of words and our words lay the foundation for others to do the same.

Data and Reality

Thursday, March 15th, 2012

Data and Reality: A Timeless Perspective on Data Management by Steve Hoberman.

I remember William Kent, the original author of “Data and Reality” from a presentation he made in 2003, entitled: “The unsolvable identity problem.”

His abstract there read:

The identity problem is intractable. To shed light on the problem, which currently is a swirl of interlocking problems that tend to get tumbled together in any discussion, we separate out the various issues so they can be rationally addressed one at a time as much as possible. We explore various aspects of the problem, pick one aspect to focus on, pose an idealized theoretical solution, and then explore the factors rendering this solution impractical. The success of this endeavor depends on our agreement that the selected aspect is a good one to focus on, and that the idealized solution represents a desirable target to try to approximate as well as we can. If we achieve consensus here, then we at least have a unifying framework for coordinating the various partial solutions to fragments of the problem.

I haven’t read the “new” version of “Data and Reality” (just ordered a copy) but I don’t recall the original needing much in the way of changes.

The original carried much the same message, that all of our solutions are partial even within a domain, temporary, chronologically speaking, and at best “useful” for some particular purpose. I rather doubt you will find that degree of uncertainty being confessed by the purveyors of any current semantic solution.

I did pull my second edition off the shelf and with free shipping (5-8 days), I should have time to go over my notes and highlights before the “new” version appears.

More to follow.

Then BI and Data Science Thinking Are Flawed, Too

Tuesday, March 13th, 2012

Then BI and Data Science Thinking Are Flawed, Too

Steve Miller writes:

I just finished an informative read entitled “Everything is Obvious: *Once You Know the Answer – How Common Sense Fails Us,” by social scientist Duncan Watts.

Regular readers of Open Thoughts on Analytics won’t be surprised I found a book with a title like this noteworthy. I’ve written quite a bit over the years on challenges we face trying to be the rational, objective, non-biased actors and decision-makers we think we are.

So why is a book outlining the weaknesses of day-to-day, common sense thinking important for business intelligence and data science? Because both BI and DS are driven from a science of business framework that formulates and tests hypotheses on the causes and effects of business operations. If the thinking that produces that testable understanding is flawed, then so will be the resulting BI and DS.

According to Watts, common sense is “exquisitely adapted to handling the kind of complexity that arises in everyday situations … But ‘situations’ involving corporations, cultures, markets, nation-states, and global institutions exhibit a very different kind of complexity from everyday situations. And under these circumstances, common sense turns out to suffer from a number of errors that systematically mislead us. Yet because of the way we learn from experience … the failings of commonsense reasoning are rarely apparent to us … The paradox of common sense, therefore, is that even as it helps us make sense of the world, it can actively undermine our ability to understand it.”

The author argues that common sense explanations to complex behavior fail in three ways. The first error is that the mental model of individual behavior is systematically flawed. The second centers on explanations for collective behavior that are even worse, often missing the “emergence” – one plus one equals three – of social behavior. And finally, “we learn less from history than we think we do, and that misperception skews our perception of the future.”

Reminds me of Thinking, Fast and Slow by Daniel Kahneman.

Not that two books with a similar “take” proves anything but you should put them on your reading list.

I wonder when/where our perceptions of CS practices have been skewed?

Or where that has played a role in our decision making about information systems?

Identity – The Philosophical Challenge For the Web

Sunday, February 19th, 2012

Identity – The Philosophical Challenge For the Web by Matthew Hurst.

From the post:

I work in local search at Microsoft which means, like all those working in this space, I have to deal with an identity crisis on a daily basis. Currently, most local search products – like Bing’s and Google’s – leverage multiple data sets to derive a digital model of the world that users can then interact with. In creating this digital model, multiple statements have to be conflated to form a unified representation. This can be extremely challenging for two reasons. Firstly, the system has to decided when two records are intended to denote the same real world entity. Secondly, the designers of the system have to determine what real world entities are and how to describe them.

For example, if a business moves is that the same business or the closure of one and the opening of another? What does it mean to categorize a business? The cafe in Barnes and Noble is branded Starbucks but isn’t actually part of the Starbucks chain – should is surface as a separate entity or is it ‘hidden’ within the bookshop as an attribute (‘has cafe’)?

Thinking through these hard representational problems is as much part of the transformative trends going on in the tech industry as are those characterized by terms like ‘big data’ and ‘data scientist’.

Questions of identity and how to resolve different multiple references to the same entity have been debated at least since the time of Greek philosophers. Identity (Wikipedia page, see references on the various pages.)

This “philosophical challenge” has been going on for a very long time and so far I haven’t seen any demonstrations that the Web raises new questions.

You need to read Matthew’s identity example in his post.

The songs in question could be said to be instances of the same subject and a reference to that subject would be satisfied with any of those instances. From another point of view, the origin of the instances could be said to distinguish them into different subjects, say for proof of licensing purposes. Other view points are possible. Depends upon the purpose of your criteria of identification.

Fractals in Science, Engineering and Finance (Roughness and Beauty)

Saturday, January 7th, 2012

Fractals in Science, Engineering and Finance (Roughness and Beauty) by Benoit B. Mandelbrot.

About the lecture:

Roughness is ubiquitous and a major sensory input of Man. The first step to measure and simulate it was provided by fractal geometry. Illustrative examples will be drawn from the sciences, engineering (the internet) and (more extensively) the variation of financial prices. The beauty of fractals, an unanticipated “premium,” helps in teaching and bridges some chasms between different aspects of knowing and feeling.

Mandelbrot summaries his career as the pursuit of a theory of roughness.

Discusses the use of the eye as well as the ear in discovery (which I would call identification) of phenomena.

Have you listened to one of your subject identifications lately?

Are subject identifications rough? Or are they the smoothing of roughness?

Do your subjects have self-similarity?

Definitely worth your time.

First seen at: Benoît B. Mandelbrot: Fractals in Science, Engineering and Finance (Roughness and Beauty) over at Computational Legal Studies.

Thinking, Fast and Slow

Tuesday, December 27th, 2011

Thinking, Fast and Slow by Daniel Kahneman, Farrar, Straus and Giroux, New York, 2011.

I got a copy of “Thinking, Fast and Slow” for Christmas and it has already proven to be an enjoyable read.

Kahneman says early on (page 28):

The premise of this book is that it is easier to recognize other people’s mistakes than our own.

I thought about that line when I read a note from a friend that topic maps needed more than my:

tagging everything with “Topic Maps….”

Which means I haven’t been clear about the reasons for the breath of materials I have and will be covering in this blog.

One premise of this blog is that the use and recognition of identifiers is essential for communication.

Another premise of this blog is that it is easier for us to study the use and recognition of identifiers by others, much for the same reasons we can recognize the mistakes of others more easily.

The use and recognition of identifiers by others aren’t mistakes but they may be different from those we would make. In cases where they differ from ours, we have a unique opportunity to study the choices made and the impacts of those choices. And we may learn patterns in those choices that we can eventually see in our own choices.

Understanding the use and recognition of identifiers in a particular circumstance and the requirements for the use and recognition of identifiers, is the first step towards deciding whether topic maps would be useful in some circumstance and in what way?

For example, processing social security records in the United States, anything other than “bare” identifiers like a social security number may be unnecessary and add load with no corresponding benefit. Aligning social security records with bank records, might need to reconsider the judgement to use only social security numbers. (Some information sharing is “against the law.” But as the Sheriff in “Oh Brother where art thou?” says: “The law is a man made thing.” Laws change, or you can commission absurdist interpretations of it.)

Topic maps aren’t everywhere but identifiers and recognition of identifiers are.

Understanding identifiers and their recognition will help you choose the most appropriate solution to a problem

Factual Resolve

Friday, October 28th, 2011

Factual Resolve

Factual has a new API – Resolve:

From the post:

The Internet is awash with data. Where ten years ago developers had difficulty finding data to power applications, today’s difficulty lies in making sense of its abundance, identifying signal amidst the noise, and understanding its contextual relevance. To address these problems Factual is today launching Resolve — an entity resolution API that makes partial records complete, matches one entity against another, and assists in de-duping and normalizing datasets.

The idea behind Resolve is very straightforward: you tell us what you know about an entity, and we, in turn, tell you everything we know about it. Because data is so commonly fractured and heterogeneous, we accept fragments of an entity and return the matching entity in its entirety. Resolve allows you to do a number of things that will make your data engineering tasks easier:

  • enrich records by populating missing attributes, including category, lat/long, and address
  • de-dupe your own place database
  • convert multiple daily deal and coupon feeds into a single normalized, georeferenced feed
  • identify entities unequivocally by their attributes

For example: you may be integrating data from an app that provides only the name of a place and an imprecise location. Pass what you know to Factual Resolve via a GET request, with the attributes included as JSON-encoded key/value pairs:

I particularly like the line:

identify entities unequivocally by their attributes

I don’t know about the “unequivocally” part but the rest of it rings true. At least in my experience.