Archive for May, 2010

Authoritative Identifications?

Monday, May 31st, 2010

Sam Hunting reminded me that if a method of identification becomes authoritative, that can lead to massive loss of data (prior methods of identification). We were discussing the Semantic Web Challenge. That assumes systems that do not support multiple “authoritative” and alternative identifications.

While I can understand the concern, I think it is largely unwarranted.

Natural language and consequently identification have been taking care of themselves in the face of “planned” language proposals for centuries. According to Klaus Schubert in the introduction to: Interlinguistics: Aspects of the Science of Planned Languages, Berlin: Mouton de Gruyter, 1989, there are almost 1,000 such projects, most since the second half of the 19th century. I suspect the count was too low by the time it was published.

The welter of identifications has continued merrily along for more than the last 20 years so I don’t feel like we are in any imminent danger of uniformity.

And, as a practical matter, more that a Billion speakers of Chinese, Japanese and Korean are bringing their concerns and identifications of subjects to the WWW in a way that will be hard to ignore. (Nor should they be.)

Systems that support multiple authoritative and alternative identifications will be the future of the WWW.

PS:The use of owl:sameAs is a pale glimmer of what needs to be possible for reliable mappings of identifications. The reason for any mapping remains unknown.

Semantic Web Challenge

Monday, May 31st, 2010

The Semantic Web Challenge 2010 details landed in my inbox this morning. My first reaction was to refine my spam filter. 😉 Just teasing. My second and more considered reaction was to think about the “challenge” in terms of topic maps.

Particularly because a posting from the Ontology Alignment Evaluation Initiative arrived the same day, in response to a posting from sameas.org.

I freely grant that URIs that cannot distinguish between identifiers and resources without 303 overhead are poor design. But the fact remains that there are many data sets, representing large numbers of subjects that have even poorer subject identification practices. And there are no known approaches that are going to result in the conversion of those data sets.

Personally I am unwilling to wait until some new “perfect” language for data sweeps the planet and results in all data being converted into the “perfect” format. Anyone who thinks that is going to happen needs to stand with the end-of-the-world-in-2012 crowd. They have a lot in common. Magical thinking being one common trait.

The question for topic mappers to answer is how do we attribute to whatever data language we are confronting, characteristics that will enable us to reliably merge information about subjects in that format either with other information in the same or another data language? Understanding that the necessary characteristics may vary from data language to data language.

Take the lack of a distinction between identifier and resource in the Semantic Web for instance. One easy step towards making use of such data would be to attribute to each URI the status of either being an identifier or a resource. I suspect, but cannot say, that the authors/users of those URIs know the answer to that question. It seems even possible that some sets of such URIs are all identifiers and if so marked/indicated in some fashion, they automatically become useful as just that, identifiers (without 303 overhead).

As identifiers they may lack the resolution that topic maps provide to the human user, which enables them to better understand what subject is being identified. But, since topic maps can map additional identifiers together, when you encounter a deficient identifier, simply create another one for the same subject and map them together.

I think we need to view the Semantic Web data sets as opportunities to demonstrate how understanding subject identity, however that is indicated, is the linchpin to meaningful integration of data about subjects.

Bearing in mind that all our identifications, Semantic Web, topic map or otherwise, are always local, provisional and subject to improvement, in the eye of another.

Encyclopedia of Database Systems

Sunday, May 30th, 2010

Encyclopedia of Database Systems is a massive reference work on database systems, numbering some 3752 pages and being 8 inches thick (20.3 centimeters).

My first impression was favorable, particularly since entries included synonyms for entries, historical background materials, cross-references and recommended reading. All the things that I appreciate in a reference work.

The entry for record linkage was disappointing in several respects.

It focuses on statistical disclosure control (SDC), a current use of record linkage, but hardly the range of record linkage uses. For a more accurate account of record linkage see William Winkler’s Overview of Record Linkage and Current Research Directions.

Only two synonyms were given for record linkage, Record Matching and Re-identification. No mention of entity heterogeneity, list washing, entity reconciliation, co-reference resolution, etc.

The “synonyms” under Record Matching (the main article for record linkage) point back to the article Record Matching. Multiple terms that point to the main entry are useful. But to have the main entry point to terms that only point back to it waste a reader’s time.

There was a quality control problem in terms of currency of cited research. For William Winkler, one of the leading researchers on record linkage, the most recent citation under Record Matching dates from 1999. Which omits Record Linkage References (Winkler, 2008), Overview of Record Linkage for Name Matching (Winkler, 2008), and, Overview of Record Linkage and Current Research Directions (Winkler, 2006).

My question becomes: What is missing from entries where I lack the familiarity to notice the loss?

Resources that are online is should have hyperlinks. Under the record linkage, Winkler’s 1999 The state of record linkage and current research problems is listed but without any link to the online version. Most of the cited resources are available either from commercial publishers (like the publisher of this tome) or freely online. Hyperlinks would be a value-add to readers.

The 10,696 bibliographic entries are scattered across 3752 pages. In addition to listing the bibliographic entries with each entry (as hyperlinks when possible), there should be comprehensive bibliography for the work. Such hyperlinks could be the basis for a cited-by value-add feature.

With clever use of the subject listings and more complete synonym lists, another value-add would be to provide readers with a dynamic “latest” research on each subject listing.

This review was of the electronic version, which was delivered as a series of separate PDF files. Which quite naturally means that the hyperlinks entries that occur in different sections, do not work. Defeats part of the utility of having an electronic version, at least in my view.

To their credit, Springer has made the subject listing for this work available in XML. Perhaps some enterprising graduate student will use that as a basis for a “latest” research listing.

I will be doing a more systematic review but stumbled across the entry for the W3C. The synonym for W3C is not World Wide Web consortium. Note the lowercase “consortium.” Rather, World Wide Web Consortium. And “Recommended Reading” for that entry, “W3C. Available at: http://www.w3.org” reinforces my point about quality control on references.

This is a very expensive work but I have no objection to commercial publishing, even expensive commercial publishing. I do have an expectation that I will find quality, innovation and value-add as the result of commercial publishing. So far, that expectation has been disappointed in this case.

PS: Every time an author’s name appears either for an entry or a cited work, there should be a hyperlink to the author’s entry in DBLP. That gives a reader access to a constantly updated bibliography of the author’s publications. Another value-add.

Association Rules

Saturday, May 29th, 2010

Apologies for posting on association rules in Private Mining of Association Rules, a term of art that might be confusing to topic map advocates, without defining it.

When we buy an item online, most retailers suggest that other buyers also purchased … some list of items. The “association” of those items together can be represented by a Boolean vector, composed of values for the presence or absence of an item. To form an association rule, such a vector is accompanied by support and confidence values.

The support value indicates the percentage of a data set where the association occurs. That is the items in question appear together.

The confidence value indicates what percentage of one value is accompanied by another.

Minimums of these values are known as minimal support threshold and minimal confidence threshold and typically appear together.

For more information on “association rules,” see Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, at page 229. (The publication date for the second edition in WorldCat (the link on the title) is wrong. Should be 2006.)

Supplemental Materials for Data Mining. I am checking on the status of the apparent 3rd edition so you might want to wait on buying a copy. Would make a great text for an advanced topic maps course that focused on populating a topic map.

Private Mining of Association Rules

Friday, May 28th, 2010

Private Mining of Association Rules (2005) examines how parties can share association rules for data mining, without sharing data.

The authors develop a secure collaborative association rule mining protocol based upon a homomorphic encryption scheme.

Developing a similar approach for topic maps would be a nice doctoral project. Association rule data mining and the associated privacy concerns are well known. Combining those in a topic map context would be an interesting piece of work.

Author Bibliographies:

Justin Z. Zhan

Stan Matwin

LiWu Chang

Blast From The Past

Thursday, May 27th, 2010

Can you place the following quote?

…my invention uses reason in its entirety and it, in addition, a judge of controversies, an interpreter of notions, a balance of probabilities, a compass which will guide us over the ocean of experiences, an inventory of things, a table of thoughts, a microscope for scrutinizing present things, a telescope for predicting distant things, a general calculus, an innocent magic, a non-chimerical cabal, a script which all will read in their own language; and even a language which one will be able to learn in a few weeks, and which will be soon accepted amidst the world. And which will lead the way for the true religion everywhere it goes.

I have to admit when I first read the part about “…one will be able to learn it in a few week,…” I was thinking about John Sowa and one of his various proposals (some say perversions) of natural language.

Then I got to the part about “…the way for the true religion…” and realized that this was probably either a fundamentalist quote (you pick the tradition) or from an earlier time.

Curious? It was Leibniz, Letter to Duke of Hanover, 1679. Quoted in The Search For The Perfect Language by Umberto Eco. More on the book in later posts.

OrientDB

Thursday, May 27th, 2010

OrientDB is a NoSQL database.

The performance and scaling numbers are nothing short of amazing.

A couple of early comments:

Getting Oriented with OrientDB.

OrientDB: A new Open Source NoSQL DBMS (google.com)

Caution: While I favor exploration of new data structures and technologies, only a limited amount of data will ever be available in any one structure. Even the reputed 102 billion items in the Amazon servers represent only part of the information available about those items.

I remain a fan of Barta’s virtual topic maps that are composed from disparate data sources.

Heterogeneous Collaborations

Thursday, May 27th, 2010

Knowledge sharing in heterogeneous collaborations – a longitudinal investigation of a cross-cultural research collaboration in nanoscience by Steffen Kanzler researches the impact of culture on sharing of knowledge.

The take away from this research project for topic maps is that “knowledge sharing” is far more complex than simply saying “share knowledge.”

The technical side of integrating multiple heterogeneous representatives of the same subjects is a worthy research goal.

However, the best topic map engine in the world isn’t very useful if people aren’t motivated to use it.

Terrorism Resources

Wednesday, May 26th, 2010

Terrorism Informatics Resources is a resource listing for an area where topic maps can make a difference.

Ontological Emptiness

Wednesday, May 26th, 2010

Bernard Vatant’s ontological emptiness comment on mapping of identifiers continues to haunt me.

I am tempted to say that if the identifiers are unambiguous, then an ontologically empty mapping is sufficient. What more is there to say than each of two or more identifiers do in fact identify the same subject?

That begs the identification question doesn’t it? To say that two or more identifiers identify the same subject presumes a judgment on some basis that the identifiers do in fact represent the same subject. Bernard is asserting is that a mapping in the absence of a basis for mapping is sufficient.

When put that way, “mapping in the absence of a basis for mapping,” then Bernard’s proposal seems deeply problematic, at least for human users.

For computers a mapping is always just a mapping. There may be reasons to include or exclude the basis for a mapping, but end of the day, the result is a mapping. (There may be values that trigger mappings but that isn’t the same as a “reason” for a mapping.)

For the human user, on the other hand, the information “behind” each identifier, is what they use to form a judgment about the subject an identifier represents. That enables them to form a judgment about the mapping of identifiers. And whether they wish to follow the same mapping.

Perhaps we should separate the question of how to communicate to a user why a mapping has occurred from the simple fact of mapping in an information system? The information system is incapable of caring by definition and perhaps the basis for mapping is simply clutter from its perspective. The human user, on the other hand, needs the information that is meaningless to the information system.

NoSQL Summer

Wednesday, May 26th, 2010

NoSQL Summer

If you enjoyed summer reading club at the library as a child, this is the summer reading program for you!

Nine cities are already forming reading clubs for a papers that cover from “Access Path Selection in an RDBMS” by P. Griffiths Selinger & al., to “Google’s BigTable” by by Fay Chang & al.

A Mapmaker’s Manifesto

Tuesday, May 25th, 2010

Search Patterns by Peter Moreville and Jeffrey Callender should be on your must read list. Their “Mapmaker’s Manifesto” will give you an idea of why I like the book:

  1. Search is a problem too big to ignore.
  2. Browsing doesn’t scale, even on an IPhone.
  3. Size matters. Linear growth compels a step change in design.
  4. Simple, fast, and relevant are table stakes.
  5. One size won’t fit all. Search must adapt to context.
  6. Search in iterative, social, and multisensory.
  7. Increments aren’t enough. Even Google must innovate or die.
  8. It’s not just about findability. It’s not just about the Web.
  9. The challenge is radically multidisciplinary.
  10. We must engage engineers and executives in design.
  11. We can learn from the past. Library science is still relevant.
  12. We can learn from behavior. Interaction design affords actionable results.
  13. We can learn from one user. Analytics is enriched by ethnography.
  14. Some patterns, we should study and reuse.
  15. Some patterns, we should break like a bad habit.
  16. Search is a complex adaptive system.
  17. Emergence, cocreation, and self-organization are in play.
  18. To discover the seeds of change, go outside.
  19. In science, fiction, and search, the map invents the territory.
  20. The future isn’t just unwritten—it’s unsearched.

I also like Search Patterns because the authors’ concede there are vast unknowns as opposed to saying: “If you just use our (insert paradigm/syntax/ontology/language) then all those nasty problems go away.”

I think we need to accept their invitation to face the vast unknowns head on.

Knowledge Is Power

Monday, May 24th, 2010

Sir Francis Bacon originated the aphorism “Knowledge is power.” (Actually he said, “nam et ipsa scientia potestas est”….)

How powerful?

The 9/11 Report points out:

Agencies uphold a “need-to-know” culture of information protection rather than promoting a “need-to-share” culture of integration. (page 417)

Fast forward seven years and we find:

[Information Sharing Environment – ISE] Gaps exist in….(3) determining the results to be achieved by the ISE (that is, how information sharing is improved) along with associated milestones, performance measures, and the individual projects. (Information Sharing [2008]

Seven years later and there are gaps in “how information sharing is improved…..”?

The power of not sharing knowledge is powerful enough to maintain data silos even in the face of national peril.

Topic maps can help you breach any silo you can access. Make that access meaningful and effective.

Not just national security data silos. Take mapping data silos of a regulated industry, say financial institutions. A mapping that grows with every audit/investigation.

Your choices are: 1) Wait for someone to relinquish power, or 2) Increase your power by breaching their data silo. Which one is for you?

New York Times – Developer Network

Monday, May 24th, 2010

New York Times – Developer Network

APIs from one of the largest news organizations in the world to access articles, campaign finance, congressional votes, best sellers and other information.

Think you can use topic maps to one-up the interfaces at the New York Times? Here’s your chance!

*****
I wrote to tech support requesting this link (could not find it on the homepage). Reply was they had no idea what I was talking about but forwarded it to the “appropriate” department. Actual human response, no auto-reply. I found it with a search engine. The New York Times help desk needs a topic map! 😉

QuaaxTM Topic Maps Engine Update!

Sunday, May 23rd, 2010

QuaaxTM Topic Map Engine version 0.51 has been released.

Small topic map engine written in PHP5. Uses MySQL with InnoDB enabled for storage.

Author: Johannes Schmidt

The Clio Project

Sunday, May 23rd, 2010

The Clio Project is a collaboration between University of Toronto and the IBM Almaden Research Center to build tools to simplify conversion from one data format to another, or in the words of the project “…Clio can automatically generate either a view, to reformulate queries against one schema into queries on another for data integration, or code, to transform data from one representation to the other for data exchange.”

The text with some screen shots at IBM says:

The Schema Viewer allows users to draw arrows between source and target schema elements. Such arrows may cross nesting levels, combine multiple elements, split and merge tables, etc. Clio incrementally interprets these arrows as mappings and generates a query accordingly.

Is that the same as Bernard Vatant’s ontological emptiness? That is all we know is that the user drew an arrow from one schema element to another?

The project has produced a number of papers but no software that is openly available. I will request a copy and report back.

Peter McBrien

Saturday, May 22nd, 2010

Peter McBrien focuses on data modeling and integration.

Part of the AutoMed project on database integration. Recent work includes temporal constraints and P2P exchange of heterogeneous data.

Publications (dblp).

Homepage

Databases: Tools and Data for Teaching and Research: Useful collection of datasets and other materials on databases, data modeling and integration.

I first encountered Peter’s research in Comparing and Transforming Between Data Models via an Intermediate Hypergraph Data Model.

From a topic map perspective, the authors assumed the identities of the subjects to which their transformation rules were applied. Someone less familiar with the schema languages could have made other choices.

That’s the hard question isn’t it? How to have reliable integration without presuming a common perspective/interpretation of the schema languages?

*****
PS: This is the first of many posts on researchers working in areas of interest to the topic maps community.

CoReference Service

Saturday, May 22nd, 2010

Coreference as Service by Bernard Vatant says the ontological emptiness of an identifier mapping service determines its usefulness.

I wonder how to know when that will be true?

That is I can imagine use cases where empty mapping of identifiers is good enough for some purpose.

In the case Bernard is talking about the identifiers are of geographic locations. Perhaps there is a common enough frame of reference for that to work.

On the other hand, I can imagine coreference services with mappings based upon “attributes” associated with identifiers.

How to judge between which one to use seems like an open question to me.

Balisage 2010 Contest – Wikis: Tower-of-Babel

Friday, May 21st, 2010

I no sooner point out that the Balisage conference lacks topic maps papers than a challenge lands in my inbox.

A challenge I could not tailor more for topic maps.

Coincidence? You decide.

As part of the Balisage 2010 Conference, MarkLogic has put forth a challenge in the form of a contest. The goal of the contest is to encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.

Wikis: tower-of-babel Solve the modern tower of babel

Contest Description: In the past few decades, as a planet, we’ve succeeded tremendously in standardizing a number of technologies (yay us!). Wiki technology (other than its underlying use of web technologies as a platform) is not solidly in this list. There is a lot of content available today in a variety of wiki syntaces. This syntax is not standardized. Some argue it shouldn’t be. Go beyond the existing debates, diatribes, and arguments. Put us on a practical track to fixing this and ensuring we will have access to this content for the long haul.

To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

Entries will be evaluated based on criteria that includes:

* How well does the entry understand the current state of the art?
* How well does the entry identify key stake holders and actors
(including history, motivation, and so on)
* Is the entry clear on its objectives? (The summary allows for
some variance here).
* Is the approach/vision elegant, clever, or mind-changing?
* Are the set of steps actionable and implementable?

Guidelines, rules, and prize:

1. Please no more than 10000 words.
2. Entries should be submitted by July 15th to:
balisage-2010-contest at marklogic dot com
3. Author(s) retains copyright and grants MarkLogic a non-exclusive
license to publish the winning entry.
4. The winner will be announced on August 3rd at the conference and
will take home a choice of
* Apple 15″ (i5) MacBook Pro
* Apple MacBook Air or
* USD $2000
5. The winner will be strongly encouraged (but not required) to give a
brief summary (~10 minutes) of their winning entry at the conference
on August 3rd.
6. Employees of MarkLogic are not eligible.
7. Judges decision is final.
8. Contest-related questions may also be submitted to:
balisage-2010-contest at marklogic dot com.

Are you ready to take the challenge?

Call to Arms! (err, Conference)

Thursday, May 20th, 2010

Balisage presentation schedule, Montreal, 3-6 August 2010, has been posted.

I see XQuery, C. S. Peirce’s type/token, XForms, polyhierarchical markup, parallel processing of XML, XSLT, Java, hey….!

No topic maps!

Here’s your chance! There are five (5) late-breaking news slots. Slots on the program for the very latest, cutting edge technical excellence (or boners). Can’t ever tell which one.

I am sure there are new ideas, applications, or analysis in topic maps that merit presentation at Balisage.

A bit about Balisage. It is the place to meet the people who are shaping the future of markup. It is like going to your local record store and getting to spend time with Lady GaGa. Well, maybe not quite like that but similar to that. Well, maybe not even similar to that, although I bet we could find someone who thinks Michael is cute.

Seriously, if you have seen the A-list for research and publication on markup and related issues, you have seen the lineup for Balisage.

Don’t let this Balisage pass without a strong paper or two on topic maps!

Balisage Call for Late-Breaking News

PS: Even if you don’t submit a paper, please try to attend. Simply the best markup conference of the year.

Context of Data?

Wednesday, May 19th, 2010

Cristiana Bolchini and others in And What Can Context Do For Data? have started down an interesting path for exploration.

That all data exists in some context is an unremarkable observation until one considers how often that context can be stated, attributed to data, to say nothing of being used to filter or access that data.

Bolchini introduces the notion of a context dimension tree (CDT) which “models context in terms of a set of context dimensions, each capturing a different characteristic of the context.” (CACM, Nov. 2009, page 137) Note that dimensions can be decomposed into sub-trees for further analysis. Further operations combine these dimensions into the “context” of the data that is used to produce a particular view of the data.

Not quite what is meant by scope in topic maps but something a bit more nuanced and subtle. I would argue (no surprise) that the context of a subject is part and parcel of its identity. And how much of that context we choose to represent will vary from project to project.

Further reading:

Bolchini, C., Curino, C. A., Quintaretti, E., Tanca, L. and Schreber, F. A. A data-oriented study of context models. SIGMOD Record, 2007.

Bolchini, C., Quintaretti, E. and Rossato, R. Relational data tailoring through view composition. In Proc. Intl. Conf. on Conceptual Modeling (ER’2007). LNCS. Nov. 2007

Context-ADDICT (its an acronym, I swear!) Website for the project developing this line of research. Prototype software available.

eBooks and Topic Maps

Tuesday, May 18th, 2010

Opportunities for topic maps as stand alone information products.

The Kobo eReader has 1 GB of storage standard and holds up to 1,000 titles. Topic maps for either for content navigation in general or particular books. A topic map of Jane Austen’s “Pride and Prejudice” might excite one of my college English professors, I don’t think it would be a real “hot” number in terms of sales. (Austen’s work is the default on the advertising I get at Border’s. Your display may be different.) For further information, Kobo Developer Program

Kindle (Amazon product) is another option. I would put in a link to their developer resources but all the strings have tracking information embedded in them. Just go to Amazon and follow the links to the Kindle resources. (A simple link to developer resources would be nice, just in case you know someone at Amazon.)

Or Lulu, a traditional print-on-demand/ebook publisher, has released LuLu for Developers. The LuLu company profile points out that in 2008, there were 276,489 books traditionally published in the United States. LuLu alone published 400,000 titles last year. Perhaps not every title merits a topic map but what if you created a topic map for a group of titles? That would promote sales of the titles as a group and be a value add to users.

Suppose I should also mention iPad Apps. Since I don’t have a cell phone, much less an iPhone, this one would be a really steep learning curve for me. Please post pointers to anyone developing topic maps for the iPad.

I haven’t tried one of these eformats with topic maps (yet) but suspect that once a book is “in” any of the formats, reliable pointing into them will be possible.

Imagine the “truth squads” who would want sell their “version” along side popular books. And then responses, using your topic map to reply to the first response.

The Story of Blow

Tuesday, May 18th, 2010

Thomas Neidhart‘s comments made me realize I had been too brief on the issue of subject identifiers. I want to correct that by telling “The Story of Blow.”

If you are reading this post you are likely online so please open up another browser window to: Merriam-Webster and type in the search box the word “blow.”

Working from my post What Makes Subject Identifiers Different?, let’s go down my four points for “blow.”

1) Quite clearly “blow” identifies a lot of different subjects. So it is a “subject identifier” in the non-topic map sense.

2) And just as clearly, “blow” can be, has the capacity to, lead us to additional information. That is it can be resolved.

Doesn’t mean it will be resolved, only that resolution is possible.

3) The additional information point is illustrated by the Merriam-Webster entry. As a transitive verb, it lists some 14 separate meanings. All of which involve additional information to know which one is meant.

Btu the dictionary is just a common example.

Another is the information that speakers of English carry around about the meanings of “blow.”

Which means our resolutions of “blow” can differ from that of others. (The “vocabulary problem.”)

4) The additional information in a dictionary is explicit. That is you and I can both examine the same information.

That is in contrast to each of us hearing the term “blow” in conversation or over the radio/TV and deciding privately what was meant. We go through the first three steps but not to the fourth.

I could say: “That was good blow.” and leave you wondering what possible meaning I have assigned to the term “blow.” I’m surprised the dictionary omits this one, in another lifetime I would have understood it to be a reference to cocaine. So if I wanted that usage to be understood by others, I had better mark it with a Subject Identifier so as to make that meaning explicit.

I can think of several other missing definitions for “blow.” Can you?

PS: I was amused at the example given for the sense of “blow” as to spend extravagantly, “I will blow you to a steak.” Since Google reports no “hits” on that string I suspect it was inserted to catch anyone copying their definitions.

Precision versus Recall

Monday, May 17th, 2010

High precision means resources are missed.

High recall means sifting garbage.

Q: Based on what assumption?

A: No assumption, observed behavior of texts and search engines.

Q: Based on what texts?

A: All texts, yes, all texts.

Q: Texts where same subjects have different works/phrases and same words/phrases mean different subjects?

A: Yes, those texts!

Q: If the subjects were identified in those texts, we could have high precision and high recall?

A: Yes, but not possible, too many texts!

Q: If the authors of new texts were to identify….

A: Sorry, no time, have to search now. Good-bye!

What Makes Subject Identifiers Different?

Sunday, May 16th, 2010

What makes Subject Identifiers (topic maps sense) different from subject identifiers (non-topic maps sense)?

Summary of the argument/answer for the impatient:

Property subject identifier Subject Identifier
Identifies Subject Yes Yes
Resolvable Yes Yes
Resolution = More Information Yes Yes
Explicit Information No Yes

Identifies Subject

All “subject identifiers” and “Subject Identifiers identify subjects.

Words, for example, as “subject identifiers,” identify subjects.

Resolvable

All “subject identifiers” and “Subject Identifiers” are resolvable. That is they can lead to more information.

Resolution = More Information

The resolution of a “subject identifier” or “Subject Identifier” leads to information that identifies the subject it represents.

Explicit Information

Resolving a “subject identifier” does not lead to explicit information. Known only to the listener.

Resolving a “Subject Identifier” does lead to explicit information. Known to anyone who looks.

Conclusion: Resolution of “Subject Identifiers” leads to explicit information others can use to understand what subject it represents.

Index of Relationships

Saturday, May 15th, 2010

Index of Relationships

Documentation on relationships in Hibernate.

Understanding how others model relationships can influence our modeling of relationships.

(Pages are not dated. Suggestion on version(s) of Hibernate covered?)

Semantic Indexing

Saturday, May 15th, 2010

Semantic indexing and searching using a Hopfield net

Automatic creation of thesauri as a means of dealing with the “vocabulary problem.”

Another topic map construction tool.

A bit dated, 1997, but will run this line of research forward and report back.

With explicit subject identity, machine generated thesauri could be reliably interchanged.

And improved upon by human users.

Society of Indexers

Friday, May 14th, 2010

Society of Indexers.

British and Irish professional body for indexing.

Some highlights:

James Lamb’s “Human or computer produced indexes?” (Reads like a one page brief for topic maps.)

Valerie A. Elliston’s “Indexing children’s books”

How to become an indexer Resources, pointers, distance-learning courses, etc.

The Indexer

Friday, May 14th, 2010

The Indexer: The International Journal of Indexing

Subscription (with The American Society for Indexing membership) is an affordable £ 28.

Public access to Indexer back issues dating from 2006 – 1958!

Subject recognition lies at the heart of indexing. The same can be said for topic maps, with the addition of making the subjects recognized explicit for use and reuse of others. This is an opportunity to study with experts on subject recognition.

Not a dull publication! Indexes Reviewed April 2005, has “Indexes praised,” “Indexes censured” (highly amusing) and, comments on the index to “My Life” by Bill Clinton.

The American Society for Indexing

Friday, May 14th, 2010

The American Society for Indexing

Topic maps arose from the ashes of an indexing project.

What better place to look for resources than an indexing society?

Journal: Key Words. The sampling of articles Key Words: Sample Articles has me reaching for my plastic to get a membership!

Special Interest Groups (Sam Hunting: Note the Culinary SIG)

Publications, resources, etc. Check it out!

(Thanks to Christopher Courington, a former student from UIUC, for reminding me of their journal and site.)