Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 25, 2010

Lost In Translation – Article

Filed under: Semantic Diversity,Semantics,Subject Identifiers,Topic Maps — Patrick Durusau @ 3:23 pm

Lost In Translation is a summary of recent research on language and its impact on our thinking by Lera Boroditsky (Professor of psychology at Stanford University and editor in chief of Frontiers in Cultural Psychology).

Read the article for the details but concepts such as causality, space and others aren’t as fixed as you may have thought.

Another teaser:

It turns out that if you change how people talk, that changes how they think. If people learn another language, they inadvertently also learn a new way of looking at the world. When bilingual people switch from one language to another, they start thinking differently, too.

Topic maps show different ways to identify the same subject. Put enough alternative identifications together and you will learn to think in another language.

Question: Should topic maps come with the following warning?

Caution: Topic Map – You May Start Thinking Differently

July 23, 2010

Topic Maps, Health Care and Interoperability

Filed under: Marketing,Semantic Diversity — Patrick Durusau @ 6:23 am

/making the ehealth> connection* by W. Ed Hammond, Ph.D., is a good summary of interoperability issues that health care IT solutions  must address.*

Interoperability issues in health care:

  • Semantic
  • Technical
  • Human/Computer
  • Communications
  • Functional
  • Data Transport
  • Decision Support Standards
  • EHR Functional Standards
  • Business
  • Security and Privacy
  • Legal, Ethical and Societal
  • Stakeholder
  • Environmental

Topic maps can address semantic interoperability but how does your application handle the other twelve (12) types of interoperability?

******

* I disagree with some of his comments on mapping solutions but I will save those for another post.

July 14, 2010

Are simplified hadoop interfaces the next web cash cow? – Post

Filed under: Hadoop,Legends,MapReduce,Semantic Diversity,Subject Identity — Patrick Durusau @ 12:06 pm

Are simplified hadoop interfaces the next web cash cow? is a question that Brian Breslin is asking these days.

It isn’t that hard to imagine that not only Hadoop interfaces being cash cows but also canned analysis of public date sets that can be incorporated into those interfaces.

But then the semantics question comes back up when you want to join that canned analysis to your own. What did they mean by X? Or Y? Or for that matter, what are the semantics of the data set?

But we can solve that issue by explicit subject identification! Did I hear someone say topic maps? 😉 So our identifications of subjects in public data sets will themselves become a commodity. There could be competing set-similarity analysis of  public data sets.

If a simplified Hadoop interface is the next cash cow, we need to be ready to stuff it with data mapped to subject identifications to make it grow even larger. A large cash cow is a good thing, a larger cash cow is better and a BP-sized cash cow is just about right.

July 13, 2010

The FLAMINGO Project on Data Cleaning – Site

The FLAMINGO Project on Data Cleaning is the other project that has influenced the self-similarity work with MapReduce.

From the project description:

Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

See the project webpage to learn more about their work on “us[ing] limited programming primitives in the cloud to implement index structures and search algorithms.”

The relationship between “dirty” data and the increase in data overall is at least linear, but probably worse. Far worse. Whether data is “dirty” depends on your perspective. The more data that appears on “***” format (fill in the one you like the least) the dirtier the universe of data has become. “Dirty” data will be with you always.

July 11, 2010

Efficient Parallel Set-Similarity Joins Using MapReduce

Efficient Parallel Set-Similarity Joins Using MapReduce by Rares Vernica, Michael J. Carey, and, Chen Li, Department of Computer Science, University of California, Irvine, used Citeseer (1.3M publications) and DBLP (1.2M publications) and “…increased their sizes as needed.”

The contributions of this paper are:

  • “We describe efficient ways to partition a large dataset across nodes in order to balance the workload and minimize the need for replication. Compared to the equi-join case, the set-similarity joins case requires “partitioning” the data based on set contents.
  • We describe efficient solutions that exploit the MapReduce framework. We show how to efficiently deal with problems such as partitioning, replication, and multiple
    inputs by manipulating the keys used to route the data in the framework.
  • We present methods for controlling the amount of data kept in memory during a join by exploiting the properties of the data that needs to be joined.
  • We provide algorithms for answering set-similarity self-join queries end-to-end, where we start from records containing more than just the join attribute and end with actual pairs of joined records.
  • We show how our set-similarity self-join algorithms can be extended to answer set-similarity R-S join queries.
  • We present strategies for exceptional situations where, even if we use the finest-granularity partitioning method, the data that needs to be held in the main memory of one node is too large to fit.”

A number of lessons and insights relevant to topic maps in this paper.

Makes me think of domain specific (as well as possibly one or more “general”) set-similarity join interchange languages! What are you thinking of?

July 7, 2010

Second Verse, Same As The First

Filed under: Marketing,RDF,Semantic Diversity,Semantic Web,Semantics — Patrick Durusau @ 2:44 pm

Unraveling Algol: US, Europe, and the Creation of a Programming Language by David Nofre, University of Amsterdam, is an interesting account of the early history of Algol.

The convention wisdom that what evolved was Algol vs. Fortran is deeply questionable.

The underlying difficulty, a familiar one in semantic integration circles, was a universal programming language versus a diversity of programming languages.

Can you guess who won?

Can you guess where I would put my money in a repeat of a universal solution vs. diverse solutions?

Where is your money riding?

July 2, 2010

Rough Fuzzies, and Beyond?

Filed under: Fuzzy Sets,Rough Sets,Semantic Diversity,Subject Identity — Patrick Durusau @ 8:18 pm

Reading Rought Sets: Theoretical Aspects of Reasoning about Data by Zdzislaw Pawlak, when I ran across this comparison of rough versus fuzzy sets:

Rough sets has often been compared to fuzzy sets, sometimes with a view to introduce them as competing models of imperfect knowledge. Such a comparison is unfounded. Indiscernibility and vagueness are distinct facets of imperfect knowledge. Indiscernibility refers to the granularity of knowledge, that affects the definition of universes of discourse. Vagueness is due to the fact that categories of natural language are often gradual notions, and refer to sets with smooth boundaries. Borrowing an example from image processing, rough set theory is about the size of pixels, fuzzy set theory is about the existence of more than two levels of grey. (pp. ix-x)

It occurred to me that the precision of our identifications or perhaps better, the fixed precision of our identifications is a real barrier to semantic integration. Because the precision I need for semantic integration is going to vary from subject to subject, depending upon what I already know, what I need to know and for what purpose. Very coarse identification may be acceptable for some purposes but not others.

I don’t know what it would look like to have varying degrees of precision to subject identification or even how that would be represented. But, I suspect solving those problems will be involved in any successful approach to semantic integration.

June 26, 2010

Semantic Compression

Filed under: Cataloging,Indexing,Semantic Diversity — Patrick Durusau @ 12:55 pm

It isn’t difficult to find indexing terms to represent documents.

But, whatever indexing terms are used, a large portion of relevant documents will go unfound. As much as 80% of the relevant documents. See Size Really Does Matter… (A study of full text searching but the underlying problem is the same: “What term was used?”)

You read a document, are familiar with its author, concepts, literature it cites, the relationships of that literature to the document and the relationships between the ideas in the document. Now you have to choose one or more terms to represent all the semantics and semantic relationships in the document. The exercise you are engaged in is compressing the semantics in a document into one or more terms.

Unlike data compression, a la Shannon, the semantic compression algorithm used by any user is unknown. We know it isn’t possible to decompress an indexing term to recover all the semantics of a document it purports to represent. Since a term is used to represent several documents, the problem is even worse. We would have to decompress the term to recover the semantics of all the documents it represents.

Even without the algorithm used to assign indexing (or tagging) terms, investigation of semantic compression could be useful. For example, encoding the semantics of a set of documents (to a set depth) and then asking groups of users to assign those documents indexing or tagging terms. By varying the semantics in the documents, it may, emphasis on may, be possible to experimentally derive partial semantic decompression for some terms and classes of users.

June 21, 2010

Looking for the stranger next door – Report

Filed under: Semantic Diversity,Usability — Patrick Durusau @ 6:02 pm

In Looking for the stranger next door Bernard Vatant states what is probably a universal user requirement: Show me what I don’t know about subject X.

Bernard has some interesting ideas on how a system might try to meet that challenge. But for the details, see his post.

June 20, 2010

“What Is I.B.M’s Watson?” – Review

Filed under: Data Mining,Semantic Diversity,Subject Identity — Patrick Durusau @ 7:34 pm

What Is I.B.M.’s Watson? appears in the New York Time Magazine on 20 June 2010. IBM or more precisely David Ferrucci and his team at IBM have made serious progress towards a useful question-answering machine. (On Ferrucci see, Ferrucci – DBLP, Ferrucci – Scientific Commons)

It won’t spoil the article to say that raw computing horsepower (BlueGene servers) plays a role in the success of the Watson project. But, there is another aspect of the project that makes it relevant to topic maps.

Rather than relying on a few algorithms to analyze questions, Watson uses more than a hundred and as summarized by the article:

Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one. In essence, Watson thinks in probabilities. It produces not one single “right” answer, but an enormous number of possibilities, then ranks them by assessing how likely each one is to answer the question.

Transpose that into a topic maps setting and imagine that you are using probabilistic merging algorithms that are applied interactively by a user in real time.

Suddenly we are not talking about a technology for hand curated information resources but an assistive technology that would enable human users go deep knowledge diving into the sea of information resources. While generating buoys and markers for others to follow.

Our ability to do that will depend on processing power, creative use and development of “probabilistic merging” algorithms and a Topic Maps Query Language that supports querying of non-topic map data and creation of content based on the results of those queries.

****

PS: For more information on the Watson project, see: What Is Watson?, part of IBM’s DeepQA project.

June 19, 2010

Demonstrating The Need For Topic Maps

Individual Differences in the Interpretation of Text: Implications for Information Science by Jane Morris demonstrates that different readers have different perceptions of lexical cohesion in a text. About 40% worth’s of difference. That is a difference in the meaning of the text.

Many tasks in library and information science (e.g., indexing, abstracting, classification, and text analysis techniques such as discourse and content analysis) require text meaning interpretation, and, therefore, any individual differences in interpretation are relevant and should be considered, especially for applications in which these tasks are done automatically. This article investigates individual differences in the interpretation of one aspect of text meaning that is commonly used in such automatic applications: lexical cohesion and lexical semantic relations. Experiments with 26 participants indicate an approximately 40% difference in interpretation. In total, 79, 83, and 89 lexical chains (groups of semantically related words) were analyzed in 3 texts, respectively. A major implication of this result is the possibility of modeling individual differences for individual users. Further research is suggested for different types of texts and readers than those used here, as well as similar research for different aspects of text meaning.

I won’t belabor what a 40% difference in interpretation implies for the one interpretation of data crowd. At least for those who prefer an evidence versus ideology approach to IR.

What is worth belaboring is how to use Morris’ technique to demonstrate such differences in interpretation to potential topic map customers. As a community we could develop texts for use with particular market segments, business, government, legal, finance, etc. An interface to replace the colored pencils used to mark all words belonging to a particular group. Automating some of the calculations and other operations on the resulting data.

Sensing that interpretations of texts vary is one thing. Having an actual demonstration, possibly using texts from a potential client, is quite another.

This is a tool we should build. I am willing to help. Who else is interested?

June 8, 2010

Semantic Overlay Networks

GridVine: Building Internet-Scale Semantic Overlay Networks sounds like they are dealing with topic map like issues to me. You be the judge:

This paper addresses the problem of building scalable semantic overlay networks. Our approach follows the principle of data independence by separating a logical layer, the semantic overlay for managing and mapping data and metadata schemas, from a physical layer consisting of a structured peer-to-peer overlay network for efficient routing of messages. The physical layer is used to implement various functions at the logical layer, including attribute-based search, schema management and schema mapping management. The separation of a physical from a logical layer allows us to process logical operations in the semantic overlay using different physical execution strategies. In particular we identify iterative and recursive strategies for the traversal of semantic overlay networks as two important alternatives. At the logical layer we support semantic interoperability through schema inheritance and semantic gossiping. Thus our system provides a complete solution to the implementation of semantic overlay networks supporting both scalability and interoperability.

The concept of “semantic gossiping” enables semantic similarity to be established the combination of local mappings, that is by adding the mappings together. (Similar to the set behavior of subject identifiers/locators in the TMDM. That is to say if you merge two topic maps, any additional subject identifiers, previously unknown to the first topic map, with enable those topics to merge with topics in later merges where previously they may not have.)

Open Question: If everyone concedes that:

  • we live in a heterogeneous world
  • we have stored vast amounts of heterogeneous data
  • we are going to continue to create/store even vaster amounts of heterogeneous data
  • we keep maintaining and creating more heterogeneous data structures to store our heterogeneous data

If every starting point is heterogeneous, shouldn’t heterogeneous solutions be the goal?

Such as supporting heterogeneous mapping technologies? (Granting there will also be a limit to those supported at any one time but it should be possible to extend to embrace others.)

Author Bibliographies:

Karl Aberer

Phillipe Cudré-Mauroux

Manfred Hauswirth

Tim Van Pelt

June 6, 2010

Citation Indexing – Semantic Diversity – Exercise

Filed under: Citation Indexing,Exercises,Indexing,Semantic Diversity — Patrick Durusau @ 10:48 am

In A Conceptual View of Citation Indexing, which is chapter 1 of Citation Indexing — Its Theory and Application in Science, Technology, and Humanities (1979), Garfield says of the problem of changing terminology and semantics:

Citations, used as indexing statements, provide these lost measures of search simplicity, productivity, and efficiency by avoiding the semantics problems. For example, suppose you want information on the physics of simple fluids. The simple citation “Fisher, M.E., Math. Phys., 5,944, 1964” would lead the searcher directly to a list of papers that have cited this important paper on the subject. Experience has shown that a significant percentage of the citing papers are likely to be relevant. There is no need for the searcher to decide which subject terms an indexer would be most likely to use to describe the relevant papers. The language habits of the searcher would not affect the search results, nor would any changes in scientific terminology that took place since the Fisher paper was published.

In other words, the citation is a precise, unambiguous representation of a subject that requires no interpretation and is immune to changes in terminology. In addition, the citation will retain its precision over time. It also can be used in documents written in different languages. The importance of this semantic stability and precision to the search process is best demonstrated by a series of examples.

Question: What subject does a citation represent?

Question: What “precision” does the citation retain over time?

Exercise: Select any article that interests you with more than twenty (20) non-self citations. Identify ten (10) ideas in the article and examine at least twenty (20) citing articles. Why was your article cited? Was your article cited for an idea you identified? Was your article cited for an idea you did not identify? (Either one is correct. This is not a test of guessing why an article will be cited. It is exploration of a problem space. Your fact finding is important.)

Extra credit: Did you notice any evidence to support or contradict the notion that citation indexing avoids the issue of semantic diversity? If your article has been cited for more than ten (10) years, try one or two citations per year for every year it is cited. Again, your factual observations are important.

Citation Indexing

Eugene Garfield’s homepage may not be familiar to topic map fans but it should be.

Garfield invented citation indexing in the late 1950’s/early 1960’s.

Among the treasures you will find here:

April 29, 2010

Second Class Citizens/Subjects

Filed under: Semantic Diversity,Subject Identity,TMRM — Patrick Durusau @ 6:39 pm

One of the difficulties that topic maps solve is the question of second class citizens (or subjects) in information systems.

The difficulty is one that Marijane raises when she quotes Michael Sperberg-McQueen wondering how topic maps differ from SQL databases, Prolog or colloquial XML?

One doesn’t have to read far to find that SQL databases, colloquial XML (and other information technologies) talk about real world subjects.*

The real world view leaves the subjects that comprise information systems out of the picture.

That creates an underclass of subjects that appear in information systems, but can never be identified or be declared to have more than one identification.

Mapping strategies, like topic maps enable users to identify any subject. Any subject can have multiple identifiers. Users can declare what properties must be present to identify a subject. Including the subjects that make up information systems.

*Note my omission of Prolog. Some programming languages may be more map friendly than others but I am unaware of any that cannot attribute properties to parts of a data structure (or its contents) for the purposes of mapping and declaring a mapping.

April 26, 2010

Are Topic Maps News?

Filed under: Semantic Diversity,Subject Identity,Topic Maps — Patrick Durusau @ 4:24 pm

Jack Park, co-editor of XML Topic Maps likes to tell me: “topic maps are not news.” I respond with a variety of explanations/defenses.

Today I wrote the following:

Topic maps are a representation of what people have been doing for as long as they been able to communicate and had different ways to identify things they wanted to talk about.

Some people were able to recognize the same subjects were being identified differently, so they created a mental mapping of the different identifiers. When we reached the age of recorded information, that mental mapping enabled them to find information recorded under different identifications for the same subject.

Topic maps, like thesauri and indexes before them, enable people to write down their mappings. And say on what basis those mappings were done. The first act enables people to use mappings done by others, like thesauri and indexes. The second act, recording the reason for the mapping (subject identity), enables the re-use of a mapping.

So, no news. Saving time, money, resources, enabling auditability/transparency, preserving institutional memory, re-use of mappings (reliably), making more information available to more people, but alas, no news.

April 19, 2010

Why Semantic Technologies Remain Orphans (Lack of Adoption)

Filed under: Data Silos,Heterogeneous Data,Mapping,Semantic Diversity,Topic Maps — Patrick Durusau @ 6:54 pm

In the debate over Data 3.0 (a Manifesto for Platform Agnostic Structured Data) Update 1, Kingsley Idehen has noted the lack of widespread adoption of semantic technologies.

Everyone prefers their own world view. We see some bright, shiny future if everyone else, at their expense, would adopt our view of the world. That hasn’t been persuasive.

And why should it be? What motivation do I have to change how I process/encode my data, in the hopes that if everyone else in my field does the same thing, then at some unknown future point, I will have some unquantifiable advantage over how I process data now?

I am not advocating that everyone adopt XTM syntax or the TMDM as a data model. Just as there are an infinite number of semantics there are an infinite number of ways to map and combine those semantics. I am advocating a disclosed mapping strategy that enables others to make meaningful use of the resulting maps.

Let’s take a concrete case.

The Christmas Day “attack” by a terrorist who set his pants on fire (Christmas Day Attack Highlights US Intelligence Failures) illustrates a failure to share intelligence data.

One strategy, the one most likely to fail, is the development of a common data model for sharing intelligence data. The Guide to Sources of Information for Intelligence Officers, Analysts, and Investigators, Updated gives you a feel for the scope of such a project. (100+ pages listing sources of intelligence data)

A disclosed mapping strategy for the integration of intelligence data would enable agencies to keep their present systems, data structures, interfaces, etc.

Disclosing the basis for mapping, whatever the target (such as RDF), will mean that users can combine the resulting map with other data. Or not. But it will be a meaningful choice. A far saner (and more cost effective) strategy than a common data model.

Semantic diversity is our strength. So why not play to our strength, rather than against it?

Zero-Sum Games and Semantic Technologies

Filed under: Mapping,Maps,Semantic Diversity,Semantic Web,Topic Maps — Patrick Durusau @ 12:39 pm

Kingsley Idehen asked why debates over semantic technologies are always zero-sum games?

I understood him to be asking about RDF vs. Topic Maps but the question could be applied to any two semantic technologies, including RDF vs. his Data 3.0 (a Manifesto for Platform Agnostic Structured Data) Update 1.

This isn’t a new problem but in fact is a very old one.

To take Kingsley’s OR seriously means a user may choose a semantic technology other than mine. Which means it may as well, or at all, with my software. (vendor interest) More importantly, given the lack of commercial interest in semantic technologies, it is a different way of viewing the world. That is, it is different from my way of viewing the world.

That is the linchpin that explains both the zero-sum nature of the debates over upper ontologies to the actual application of semantic technologies.

We prefer our view of the world to that of others.

Note that I said we. Not some of us, not part of the time, not some particular group or class, or any other possible distinction. Everyone, all the time.

That fact, everyone’s preference for their view of the world, underlies the semantic, cultural, linguistic diversity that we encounter day to day. It is a diversity that has persisted, as far as is known, throughout recorded history. There are no known periods without that diversity.

To advocate anyone adopt another view of the world, a view other than their own, even only Kingsley’s OR, means they have a different view than before. That is by definition, a zero-sum game. Either the previous view prevails, or it doesn’t.

I prefer mapping strategies (note I did not say a particular mapping strategy) because it enables diverse views to continue as is and to put the burden of that mapping on those who wish to have additional views.

April 13, 2010

Federated Search Blog

Filed under: Federated Search,Searching,Semantic Diversity — Patrick Durusau @ 7:00 am

Topic mappers need to read the Federated Search Blog on a regular basis.

First, “federated search” is how a significant part of the web community talks about gathering up diverse information resources.

Think of it as learning to say basic phrases in a foreign language. It may not be easy but your host will be impressed that you made the effort. Same lesson here.

Second, it has a high percentage of extremely useful resources. Two examples that I found while looking at the site this morning:

Third, we need to avoid being too narrowly focused. Semantic integration needs vary from navigation of known information resources to federation of information resources to integration based on probes of document sets too large for verification (those exist, to be covered in a future post).

Topic maps have something unique to offer those efforts but only if we understand the needs of others in their own terms.

April 12, 2010

Topic Maps and the “Vocabulary Problem”

To situate topic maps in a traditional area of IR (information retrieval), try the “vocabulary problem.”

Furnas describes the “vocabulary problem” as follows:

Many functions of most large systems depend on users typing in the right words. New or intermittent users often use the wrong words and fail to get the actions or information they want. This is the vocabulary problem. It is a troublesome impediment in computer interactions both simple (file access and command entry) and complex (database query and natural language dialog).

In what follows we report evidence on the extent of the vocabulary problem, and propose both a diagnosis and a cure. The fundamental observation is that people use a surprisingly great variety of words to refer to the same thing. In fact, the data show that no single access word, however well chosen, can be expected to cover more than a small proportion of user’s attempts. Designers have almost always underestimated the problem and, by assigning far too few alternate entries to databases or services, created an unnecessary barrier to effective use. Simulations and direct experimental tests of several alternative solutions show that rich, probabilistically weighted indexes or alias lists can improve success rates by factors of three to five.

The Vocabulary Problem in Human-System Communication (1987)

Substitute topic maps for probabilistically weighted indexes or alias lists. (Techniques we are going to talk about in connection with topic maps authoring.)

Three to five times greater success is an incentive to use topic maps.

Marketing Department Summary

Customers can’t buy what they can’t find. Topic Maps help customers find purchases, increases sales. (Be sure to track pre and post topic maps sales results. So marketing can’t successfully claim the increases are due to their efforts.)

April 6, 2010

Building Multilingual Topic Maps

Filed under: Conferences,Heterogeneous Data,Semantic Diversity — Patrick Durusau @ 8:42 pm

The one article of faith shared by all topic map enthusiasts is: topic maps can express anything! But having said that, “when the rubber hits the road” (Americanism, means to become meaningful, action being taken) the question is how to build a topic map, particularly a multilingual one.

We are all familiar with the ability of topic maps to place a “scope” on a name so that its language can be indicated. But that is only one aspect of a what is expected of a modern multilingual system.

Fortunately, topic map fans don’t have to re-invent multilingual information retrieval techniques!

Bookmark and use the resources found at the Cross Language Evaluation Forum. CLEF is sponsored by TrebleCLEF, an activity of the European Commission.

CLEF has almost a decade of annual proceedings and both sites offer link collection to other multilingual resources. I am going to start mining those proceedings and other documents for suggestions and tips on constructing topic maps.

Suggestions, comments, tips, etc., that you have found useful would be appreciated.

(PS: I am sure all this is old hat to European topic map folks but realize there are, ahem, parts of the world where multilingualism isn’t valued. I suspect many of the same techniques will work for multiple identifications in single languages.)

April 3, 2010

Source of Heterogeneous Data?

Filed under: Heterogeneous Data,Semantic Diversity — Patrick Durusau @ 7:19 pm

Topic maps are designed to deal with heterogeneous data. The question I have never heard asked (or answered) is: “Where does all this heterogeneous data come from?” Heterogeneous data is the topic of conversation in digital IT and pre-digital IT literature.

You would think that question would been asked and answered. I went out looking for it, since email is slow today. (Holy Saturday 2010)

If I can find a time when there wasn’t any heterogeneous data, then someone may have commented, “look, there’s heterogeneous data.” I could then track the cause forward. Sounds simple enough.

I have a number of specialized works on languages of the Ancient Near East but it turns out the Unicode standard has the information we need.

Chapter 14, Archaic Scripts has entries for both Egyptian hieroglyphics and Sumero-Akkadian. Both arose at about the same time, somewhere from the middle to the near the end of the fourth millennium BCE. That’s recorded heterogeneous data isn’t it?

For somewhere between 5,000 to 5,500 years we have had heterogeneous data. It appears to be universal, geographically speaking.

The source of heterogeneous data? That would be us. What we need is a solution that works with us and not against us. That would be topic maps.

April 2, 2010

Re-Inventing Natural Language

Filed under: Heterogeneous Data,Ontology,Semantic Diversity — Patrick Durusau @ 8:29 pm

What happens when users use ontologies? That is when ontologies leave the rarefied air of campuses, turgid dissertations and the clutches of arm chair ontologists?

Would you believe that users simply take terms from ontologies and use them as they wish? In other words, after decades of research, ontologists have re-invented natural language! With all of its inconsistent usage, etc.

I would send a fruit basket if I had their address.

For the full details, take a look at: The perceived utility of standard ontologies in document management for specialized domains. From the conclusion:

…rather than being locked into conforming to the standard, users will be free to use all or small fragments of the ontology as best suits their purpose; that is, these communities will be able to very flexibly import ontologies and make selective use of ontology resources. Their selective use and the extra terms they add will provide useful feedback on how the external ontologies could be evolved. A new ontology will emerge as the result and this itself may become a new standard ontology.

I would amend the final two sentences to read: “Their selective use and the extra terms they add will provide useful feedback on how their language is evolving. A new language will emerge as the result and this may itself become a new standard language.

Imagine, all that effort and we are back where we started. Users using language (terms from an ontology) to mean what they want it to mean and not what was meant by the ontology.

The arm chair ontologists have written down what they mean. Why don’t we ask ordinary users the same thing, and write that down?

March 18, 2010

What The World Needs Now…

Filed under: Searching,Semantic Diversity,Topic Maps — Patrick Durusau @ 11:58 am

With apologies to Jackie DeShannon I would say: topic maps!

The music is a bit over the top but e-Discovery: Did You Know? makes the case for topic maps now!

My favorite line: “At our current rate of data expansion by just 2011. There will be 2 zettabytes of ESI [Electronically Stored Information] (2 thousand) exabytes), which is as many bytes of information as there are … STARS IN THE UNIVERSE

My takeaway — the amount of mappable territory continues to expand. We can each find our own way or, we can join forces to create and share maps to what we have discovered and places we have been.

As Steve Newcomb foresaw years ago, there is a real economic opportunity in building maps into information territories. That is searchers can monetize their explorations of information resources as topic maps.

You can buy reports from Gartner but with a topic map of an information area, you can merge it with your data and reach your own conclusions.

A killer topic map application would pair itself with data exploration tools for easy creation of topic maps that can be refined as part a topic map creation process. (The tedium of matching up obscure musicians might appeal to ministry of culture types but insights into stock/bond trading, cf. The Big Short, legal discovery, medical research, are more likely to attract important users (the paying kind).)

March 16, 2010

Size Really Does Matter…

Filed under: Information Retrieval,Recall,Searching,Semantic Diversity — Patrick Durusau @ 7:20 pm

…when you are evaluating the effectiveness of full-text searching. Twenty-five years Blair and Maron, An evaluation of retrieval effectiveness for a full-text document-retrieval system, established that size effects the predicted usefulness of full text searching.

Blair and Maron used a then state of the art litigation support database containing 40,000 documents for a total of approximately 350,000 pages. Their results differ significantly from earlier, optimistic reports concerning full-text search retrieval. The earlier reports were based on sets of less than 750 documents.

The lawyers using the system, thought they were obtaining at a minimum, 75% of the relevant documents. The participants were astonished to learn they were recovering only 20% of the relevant documents.

One of the reasons cited by Blair and Maron merits quoting:

The belief in the predictability of words and phrases that may be used to discuss a particular subject is a difficult prejudice to overcome….Stated succinctly, is is impossibly difficult for users to predict the exact word, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents….(emphasis in original, page 295)

That sounds to me like users using different ways to talk about the same subjects.

Topic maps won’t help users to predict the “exact word, word combinations, and phrases.” However, they can be used to record mappings into document collections,that collect up the “exact word, word combinations, and phrases” used in relevant documents.

Topic maps can used like the maps of early explorers that become more precise with each new expedition.

March 14, 2010

Semantic Diversity – The Default Case

Filed under: Semantic Diversity — Patrick Durusau @ 6:29 pm

While constructing the food analogy to semantic diversity, it occurred to me that semantic diversity is the default case.

Despite language suppression, advocates of universal languages, Esperanto, LogLang, and those who would police existing languages, L’Académie française, semantic diversity remains the default case.

There is semantic diversity in the methods to overcome semantic diversity. Even within particular approaches to overcoming semantic diversity. You can observe diversity in ontologies at Swoogle.

I think semantic diversity continues in part because we as human beings are creative, even when addressing issues like semantic diversity. It is part of who we are to be these bubbling fountains of semantic diversity as it were.

Shouldn’t the first question to anyone hawking the latest search widget be: “Can it search effectively using my terms?” Simple enough question.

The first startup that can answer that question in the affirmative will go far.

« Newer Posts

Powered by WordPress