Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 5, 2011

The role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

Filed under: Information Retrieval,Natural Language Processing — Patrick Durusau @ 4:29 pm

The role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text by Tony Russell-Rose.

Abstract:

Here are the slides from the talk I gave at City University last week, as a guest lecture to their Information Science MSc students. It’s based on the chapter of the same name which I co-authored with Mark Stevenson of Sheffield University and appears in the book called “Information Retrieval: Searching in the 21st Century“. The session was scheduled for 3 hours, and to my amazement, required all of that (thanks largely to an enthusiastic group who asked lots of questions). And no, I didn’t present 3 hours of Powerpoint – the material was punctuated with practical exercises and demos to illustrate the learning points and allow people to explore the key concepts for themselves. These exercises aren’t included in the Slideshare version, but I am happy to make them available to folks who want to enjoy the full experience.

If you don’t look at another presentation slide deck this week, do yourself a favor and look at this one. Very well done.

I’m going to write for the exercises. Comments to follow.

March 29, 2011

Reverted Indexing

Filed under: Indexing,Information Retrieval,Query Expansion — Patrick Durusau @ 12:47 pm

Reverted Indexing

From the website:

Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify matching documents for a query.

Term-based search, however, is just one example of interactive information seeking. Other examples include offering suggestions of documents similar to ones already found, or identifying effective query expansion terms that the user might wish to use. More generally, these fall into several categories: query term suggestion, relevance feedback, and pseudo-relevance feedback.

We can combine the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, such as query expansion and relevance (and pseudo-relevance) feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.

As to its performance:

….the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.

An interesting example of innovative thinking about indexing.

With a useful result.

March 21, 2011

EuroHCIR 2011: The 1st European Workshop on Human-Computer Interaction and Information Retrieval

Filed under: Conferences,Information Retrieval,Interface Research/Design — Patrick Durusau @ 8:49 am

EuroHCIR 2011: The 1st European Workshop on Human-Computer Interaction and Information Retrieval

From the website:

HCIR, or Human-Computer Information Retrieval, was a phrase coined by Gary Marchionini in 2005 and is representative of the growing interest in uniting both those who are interested in how information systems are built (the Information Retrieval community) and those who are interested in how humans search for information (the Human-Computer Interaction and Information Seeking communities). Four increasingly popular workshops and an NSF funded event , have brought focus to this multi-disciplinary issue in the USA , and the aim of EuroHCIR 2011 is to focus the European community in the same way.

Consequently, the EuroHCIR workshop has four main goals:

  • Present and discuss novel HCIR designs, systems, and findings.
  • Identify and unite European researchers and industry professionals working in this area.
  • Facilitate and encourage collaboration and joint academic and industry ventures.
  • Define and coordinate a vision for the community for future EuroHCIR events.

The topics for the workshop look quite interesting:

  • Novel interaction techniques for information retrieval.
  • Modelling and evaluation of interactive information retrieval.
  • Exploratory search and information discovery.
  • Information visualization and visual analytics.
  • Applications of HCI techniques to information retrieval needs in specific domains.
  • Ethnography and user studies relevant to information retrieval and access.
  • Scale and efficiency considerations for interactive information retrieval systems.
  • Relevance feedback and active learning approaches for information retrieval.

Important dates:

Submissions: 1st May 2011

Notifications: 20th May 2011

Camera Ready: 2nd June 2011

Workshop: 4th July 2011

March 14, 2011

Graph-based Algorithms….

Filed under: Graphs,Information Retrieval,Natural Language Processing — Patrick Durusau @ 7:50 am

Graph-based Algorithms for Information Retrieval and Natural Language Processing

Tutorial at HLT/NAACL 2006 (June 4, 2006)

Rada Mihalcea and Dragomir Radev

From the slides:

  • Motivation
    • Graph-theory is a well studied discipline
    • So are the fields of Information Retrieval (IR) and Natural Language Processing (NLP)
    • Often perceived as completely different disciplines
  • Goal of the tutorial: provide an overview of method and applications in IR and NLP that rely on graph-based algorithms, e.g.
    • Graph-based algorithms: graph traversal, min-cut algorithms, random walks
    • Applied to: Web search, text understanding, text summarization, keyword extraction, text clustering

Nice introduction to graph-theory and why we should care. A lot.

March 4, 2011

Metaoptimize Q+A

Metaoptimize Q+A is one of the Q/A sites I just stumbled across.

From the website:

A community of scientists interested in machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization, as well as adjacent topics.

Looks like an interesting place to hang out.

January 21, 2011

Workshop on Human-Computer Interaction and Information Retrieval

Workshop on Human-Computer Interaction and Information Retrieval

From the website:

Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.

The HCIR workshop has run annually since 2007. The workshop unites academic researchers and industrial practitioners working at the intersection of HCI and IR to develop more sophisticated models, tools, and evaluation metrics to support activities such as interactive information retrieval and exploratory search. It provides an opportunity for attendees to informally share ideas via posters, small group discussions and selected short talks.

Workshop participants present interfaces (including mockups, prototypes, and other early-stage designs), research results from user studies of interfaces, and system demonstrations related to the intersection of Human Computer Interaction (HCI) and Information Retrieval (IR). The intent of the workshop is not archival publication, but rather to provide a forum to build community and to stimulate discussion, new insight, and experimentation on search interface design.

Proceedings from 2007 to date are available.

I would point to the workshops separately or even some of the papers but the site helpfully returns its base URL for all resources.

Good weekend or even weekday reading!

January 20, 2011

IMMM 2011: The First International Conference on Advances in Information Mining and Management

Filed under: Conferences,Data Mining,Information Retrieval,Searching — Patrick Durusau @ 7:40 pm

IMMM 2011: The First International Conference on Advances in Information Mining and Management.

July 17-22, 2011 – Bournemouth, UK

See the Call for Papers for details but general areas include:

  • Mining mechanisms and methods
  • Mining support
  • Type of information mining
  • Pervasive information retrieval
  • Automated retrieval and mining
  • Mining features
  • Information mining and management
  • Mining from specific sources
  • Data management in special environments
  • Mining evaluation
  • Mining tools and applications

Important deadlines:
Submission (full paper) March 1, 2011
Notification April 10 , 2011
Registration April 25, 2011
Camera ready April 28, 2011

January 16, 2011

Informer

Filed under: Authoring Topic Maps,Information Retrieval,Searching — Patrick Durusau @ 2:29 pm

The Informer is the newsletter of the BCS Information Retrieval Specialist Group (IRSG).

There is a single issue in 1994, although that is volume 3, which implies there were earlier issues.

A useful source of information on IR.

It would be more useful, if there were an index.

Let’s turn that lack of an index into a topic map exercise:

  1. Select one issue of the Informer.
  2. Create a traditional index for that issue.
  3. Using one or more search engines, create a machine index for that issue.
  4. Create a topic map for that issue.

One purpose of the exercise is to give you a feel for the labor/benefit/delivery characteristics of each method.

The Changing Face of Search – Post

Filed under: Information Retrieval,Search Interface,Searching — Patrick Durusau @ 11:26 am

The Changing Face of Search, Tony Rusell-Rose along with Udo Kruschwitz, and Andy MacFarlane have penned a post about changes they see coming to search.

The entire article is worth your time but one part stood out for me:

… Personalisation does not mean that users will be required to explicitly declare their interests (this is exactly what most users do not want to do!); instead, the search engine tries to infer users’ interests from implicit cues, e.g. time spent viewing a document, the fact that a document that has been selected in preference to another ranked higher in the results list, and so on. Personalised search results can be tailored to individual searchers and also to groups of similar users (“social networks”). (emphasis in original)

[Users don’t want] to explicitly declare their interests.

This has implications for topic map authoring.

Similar to users resisting building explicit document models and/or writing in markup. (Are you listening RDF/RDFa fans?)

Complaining that users don’t want to learn markup, explicitly declare their subjects, or use pre-written RDF vocabularies, is not a solution.

Effective topic map (or other semantic) authoring solutions are going to infer subjects and assist users in correcting its inferences.

January 10, 2011

Efficient set intersection for inverted indexing

Filed under: Data Structures,Information Retrieval,Sets — Patrick Durusau @ 4:08 pm

Efficient set intersection for inverted indexing Authors: J. Shane Culpepper, Alistair Moffat Keywords: Compact data structures, information retrieval, set intersection, set representation, bitvector, byte-code

Abstract:

Conjunctive Boolean queries are a key component of modern information retrieval systems, especially when Web-scale repositories are being searched. A conjunctive query q is equivalent to a |q|-way intersection over ordered sets of integers, where each set represents the documents containing one of the terms, and each integer in each set is an ordinal document identifier. As is the case with many computing applications, there is tension between the way in which the data is represented, and the ways in which it is to be manipulated. In particular, the sets representing index data for typical document collections are highly compressible, but are processed using random access techniques, meaning that methods for carrying out set intersections must be alert to issues to do with access patterns and data representation. Our purpose in this article is to explore these trade-offs, by investigating intersection techniques that make use of both uncompressed “integer” representations, as well as compressed arrangements. We also propose a simple hybrid method that provides both compact storage, and also faster intersection computations for conjunctive querying than is possible even with uncompressed representations.

The treatment of set intersection caught my attention.

Unlike document sets, topic maps have restricted sets of properties or property values that will form the basis for set intersection (merging in topic maps lingo).

Topic maps also differ in that identity bearing properties are never ignored, whereas in searching a reverse index, terms can be included in the index that are ignored in a particular query.

What impact those characteristics will have on set intersection for topic maps remains a research question.

December 25, 2010

The Wavelet Tutorial

Filed under: Information Retrieval,Wavelet Transforms — Patrick Durusau @ 6:46 am

The Wavelet Tutorial

As its name implies, a tutorial on wavelet transformation.

It’s not often that you see engineers implied to be non-math types but this tutorial was written from an engineering perspective and not for “math people.” (The author’s term, not mine.)

More accessible than many of the wavelet transformation explanations I have seen so I mention it here.

Questions:

  1. What improvements would you suggest for this tutorial? (1-2 pages, no citations)
  2. What examples would you add to make it more relevant to information retrieval? (1-2 pages, no citations)
  3. Other wavelet tutorials that you have found helpful? (1-2 pages, citations/links)

Spectral Based Information Retrieval

Filed under: Information Retrieval,Retrieval,TREC,Vectors,Wavelet Transforms — Patrick Durusau @ 6:10 am

Spectral Based Information Retrieval Author: Laurence A. F. Park (2003)

Every now and again I run into a dissertation that is an interesting and useful survey of a field and an original contribution to the literature.

Not often but it does happen.

It happened in this case with Park’s dissertation.

The beginning of an interesting threat of research that treats terms in a document as a spectrum and then applies spectral transformations to the retrieval problem.

The technique has been developed and extended since the appearance of Park’s work.

Highly recommended, particularly if you are interested in tracing the development of this technique in information retrieval.

My interest is in the use of spectral representations of text in information retrieval as part of topic map authoring and its potential as a subject identity criteria.

Actually I should broaden that to include retrieval of images and other data as well.

Questions:

  1. Prepare an annotated bibliography of ten (10) recent papers usually spectral analysis for information retrieval.
  2. Spectral analysis helps retrieve documents but what if you are searching for ideas? Does spectral analysis offer any help?
  3. How would you extend the current state of spectral based information retrieval? (5-10 pages, project proposal, citations)

December 20, 2010

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval

The Sensitivity of Latent Dirichlet Allocation for Information Retrieval Author: Laurence A. F. Park, The University of Melbourne. slides

Abstract:

It has been shown that the use of topic models for Information retrieval provides an increase in precision when used in the appropriate form. Latent Dirichlet Allocation (LDA) is a generative topic model that allows us to model documents using a Dirichlet prior. Using this topic model, we are able to obtain a fitted Dirichlet parameter that provides the maximum likelihood for the document set. In this article, we examine the sensitivity of LDA with respect to the Dirichlet parameter when used for Information retrieval. We compare the topic model computation times, storage requirements and retrieval precision of fitted LDA to LDA with a uniform Dirichlet prior. The results show there there is no significant benefit of using fitted LDA over the LDA with a constant Dirichlet parameter, hence showing that LDA is insensitive with respect to the Dirichlet parameter when used for Information retrieval.

Note that topic is used in semantic analysis (of various kinds) to mean highly probable words and not in the technical sense of the TMDM or XTM.

Extraction of highly probably words from documents can be useful in the construction of topic maps for those documents.

December 7, 2010

Bobo: Fast Faceted Search With Lucene

Filed under: Facets,Information Retrieval,Lucene,Navigation,Subject Identity — Patrick Durusau @ 8:52 pm

Bobo: Fast Faceted Search With Lucene

From the website:

Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene.

While Lucene is good with unstructured data, Bobo fills in the missing piece to handle semi-structured and structured data.

Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing.

Features:

  • No need for cache warm-up for the system to perform
  • multi value sort – sort documents on fields that have multiple values per doc, .e.g tokenized fields
  • fast field value retrieval – over 30x faster than IndexReader.document(int docid)
  • facet count distribution analysis
  • stable and small memory footprint
  • support for runtime faceting
  • result merge library for distributed facet search

I had to go look up the definition of facet. Merriam-Webster (I remember when it was just Webster) says:

any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)

So a faceted search could search/browse, in theory at any rate, based on any property of a subject, even those I don’t recognize.

Different languages being the easiest example.

I could have aspects of a hotel room described in both German and Korean, both describing the same facets of the room.

Questions:

  1. How would you choose the facets for a subject to be included in faceted browsing? (3-5 pages, no citations)
  2. How would you design and test the presentation of facets to users? (3-5 pages, no citations)
  3. Compare the current TMQL proposal (post-Barta) with the query language for facet searching. If a topic map were treated (post-merging) as faceted subjects, which one would you prefer and why? (3-5 pages, no citations)

December 3, 2010

Dynamic Indexes?

I was writing the post about the New York Times graphics presentation when it occurred to me how close we are to dynamic indexes.

After all, gaming consoles are export restricted.

What we now consider to be “runs,” static indexes and the like are computational artifacts.

They follow how we created indexes when they were done by hand.

What happens when the properties of what is being indexed, its identifications and merging rules can change on the fly and re-present itself to the user for further manipulation?

I don’t think the fundamental issues of index construction get any easier with dynamic indexes but how we answer them will determine how quickly we can make effective use of such indexes.

Whether crossing the line first to dynamic indexes will be a competitive advantage, only time will tell.

I would like for some VC to be interested in finding out.

Caveat to VCs. If someone pitches this as making indexes more quickly, that isn’t the point. “Quick” and “dynamic” aren’t the same thing. Related but different. Keep both hands on your wallet.

S4

S4

From the website:

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Just in case you were wondering if topic maps are limited to being bounded objects composed of syntax. No.

Questions:

  1. Specify three sources of unbounded streams of data. (3 pages, citations)
  2. What subjects would you want to identify and on what basis in any one of them? (3-5 pages, citations)
  3. What other information about those subjects would you want to bind to the information in #2? What subject identity tests are used for those subjects in other sources? (5-10 pages, citations)

November 25, 2010

Sig.ma – Live views on the Web of Data

Filed under: Indexing,Information Retrieval,Lucene,Mapping,RDF,Search Engines,Semantic Web — Patrick Durusau @ 10:27 am

Sig.ma – Live views on the Web of Data

From the website:

In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.

Read one of various versions of an article on Sig.ma for the technical details.

From the Web Technologies article cited on the homepage:

Sig.ma revolves around the creation of Entity Profiles. An entity profile – which in the Sig.ma dataflow is represented by the “data cache” storage (Fig. 3) – is a summary of an entity that is presented to the user in a visual interface, or which can be returned by the API as a rich JSON object or a RDF document. Entity profiles usually include information that is aggregated from more than one source. The basic structure of an entity profile is a set of key-value pairs that describe the entity. Entity profiles often refer to other entities, for example the profile of a person might refer to their publications.

No, this isn’t an implementation of the TMRM.

This is an implementation of one way to view entities for a particular type of data. A very exciting one but still limited to a particular data set.

This is a big step forward.

For example, it isn’t hard to imagine entity profiles against particular websites or data sets. Entity profiles that are maintained and leased for use with search engines like Sig.ma.

Or going a bit further and declaring a basis for identification of subjects, such as the existence of properties a…n in an RDF graph.

Questions:

  1. Spend a couple of hours with Sig.ma researching library related questions. (Discussion)
  2. What did you like, dislike or find surprising about Sig.ma? (3-5 pages, no citations)
  3. Entity profiles for library science (Class project)

Sig.ma: Live Views on the web of data – bibliography issues

I normally start with a DOI here so you can see article in question.

Not here.

Here’s why:

Sig.ma: Live views on the Web of Data Journal of Web Semantics. (10 pages)

Sig.ma: Live Views on the Web of Data WWW ’10 Proceedings(demo, 4 pages)

Sig.ma: Live Views on the Web of Data (8 pages) http://richard.cyganiak.de/2008/papers/sigma-semwebchallenge2009.pdf

Sig.ma: Live Views on the Web of Data (4 pages) http://richard.cyganiak.de/2008/papers/sigma-demo-www2010.pdf

Sig.ma: Live Views on the Web of Data (25 pages) http://fooshed.net/paper/JWS2010.pdf

Before saying anything ugly, ;-), this is some of the most exciting research I have seen in a long time. I will cover that part of it in a following post. But, to the matter at hand, bibliographic control.

Five (5) different articles, two published in recognized journals that all have the same name? (The demo articles are the same but have different headers/footers, page numbers and so would likely be indexed as different articles.)

I will be able to resolve any confusion by obtaining the article in question.

But that isn’t an excuse.

I, along with everyone else interested in this research, will waste a small part of our time resolving the confusion. Confusion that could have been avoided for everyone.

Not unlike everyone who does the same search having to tread the same google glut.

With no way to pass on what we have resolved, for the benefit of others.

Questions:

  1. Help these authors out. How would you suggest they avoid this in the future? Use of the name is important. (3-5 pages, no citations)
  2. Help the library out. How will you deal with multiple papers with the same title, authors, pub year? (this isn’t uncommon) (3-5 pages, citations optional)
  3. How would you use topic maps to resolve this issue? (3-5 pages, no citations)

November 23, 2010

November 22, 2010

TxtAtlas

Filed under: Information Retrieval,Interface Research/Design,Mapping — Patrick Durusau @ 7:08 am

TxtAtlas

First noticed on Alex Popescu’s blog.

Text a phone number and then it appears as an entry on a map.

I have an uneasy feeling this may be important.

Not this particular application but the ease of putting content from dispersed correspondents together into a single map.

I wonder if instead of distance the correspondents could be dispersed over time? Say as users of a document archive?*

Questions:

  1. How would you apply these techniques to a document archive? (3-5 pages, no citations)
  2. How would you adapt the mapping of a document archive based on user response? (3-5 pages, no citations)
  3. Design an application of this technique for a document archive. (Project)

*Or for those seeking more real-time applications, imagine GPS coordinates + status updates from cellphones on a more detailed map. Useful for any number of purposes.

A Term Association Inference Model for Single Documents:….

Filed under: Data Mining,Document Classification,Information Retrieval,Summarization — Patrick Durusau @ 6:36 am

A Term Association Inference Model for Single Documents: A Stepping Stone for Investigation through Information Extraction Author(s): Sukanya Manna and Tom Gedeon Keywords: Information retrieval, investigation, Gain of Words, Gain of Sentences, term significance, summarization

Abstract:

In this paper, we propose a term association model which extracts significant terms as well as the important regions from a single document. This model is a basis for a systematic form of subjective data analysis which captures the notion of relatedness of different discourse structures considered in the document, without having a predefined knowledge-base. This is a paving stone for investigation or security purposes, where possible patterns need to be figured out from a witness statement or a few witness statements. This is unlikely to be possible in predictive data mining where the system can not work efficiently in the absence of existing patterns or large amount of data. This model overcomes the basic drawback of existing language models for choosing significant terms in single documents. We used a text summarization method to validate a part of this work and compare our term significance with a modified version of Salton’s [1].

Excellent work that illustrates how re-thinking of fundamental assumptions of data mining can lead to useful results.

Questions:

  1. Create an annotated bibliography of citations to this article.
  2. Citations of items in the bibliography since this paper (2008)? List and annotate.
  3. How would you use this approach with a document archive project? (3-5 pages, no citations)

November 20, 2010

Subjective Logic = Effective Logic?

Capture of Evidence for Summarization: An Application of Enhanced Subjective Logic

Authors(s): Sukanya Manna, B. Sumudu U. Mendis, Tom Gedeon Keywords: subjective logic, opinions, evidence, events, summarization, information extraction

Abstract:

In this paper, we present a method to generate an extractive summary from a single document using subjective logic. The idea behind our approach is to consider words and their co-occurrences between sentences in a document as evidence of their relatedness to the contextual meaning of the document. Our aim is to formulate a measure to find out ‘opinion’ about a proposition (which is a sentence in this case) using subjective logic in a closed environment (as in a document). Stronger opinion about a sentence represents its importance and are hence considered to summarize a document. Summaries generated by our method when evaluated with human generated summaries, show that they are more similar than baseline summaries.

The authors justify their use of “subjective” logic by saying:

pointed out that a given piece of text is interpreted by different person in a different fashion especially in the way how they understand and interpret the context. Thus we see that human understanding and reasoning is subjective in nature unlike propositional logic which deals with either truth or falsity of a statement. So, to deal with this kind of situation we used subjective logic to find out sentences which are significant in the context and can be used to summarize a document.

“Subjective” logic means we are more likely to reach the same result as a person reading the text.

Search results as used and evaluated by people.

That sounds like effective logic to me.

Questions:

  1. Read the Audun Jøsang’s article Artificial Reasoning with Subjective Logic.
  2. Summarize three (3) applications (besides the article above) of “subjective” logic. (3-5 pages, citations)
  3. How do you think “subjective” logic should be modeled in topic maps? (3-5 pages, citations optional)

November 4, 2010

Indiana University – Bioinformatics

Filed under: Bioinformatics,Biomedical,Information Retrieval — Patrick Durusau @ 10:37 am

Indiana University – Bioinformatics

The Research & Projects Page offers a sampling of the work underway.

October 31, 2010

OpenII

Filed under: Data Structures,Heterogeneous Data,Information Retrieval,Software — Patrick Durusau @ 7:20 pm

OpenII

From the website:

OpenII is a collaborative effort spearheaded by The MITRE Corporation and Google to create a suite of open-source tools for information integration. The project is leveraging the latest developments in research on information integration to create a platform on which integration applications can be built and further research can be conducted.

The motivation for OpenII is that although a significant amount of research has been conducted on information integration, and several commercial systems have been deployed, many information integration applications are still hard to build. In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. As a result, the research prototypes that have been built are generally not reusable and do not inter-operate with each other. On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts and can easily be tailored to the needs of particular domains.

OpenII tools include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching collections of schemas and extending schemas, and run-time tools for processing queries over heterogeneous data sources.

The M3 metamodel:

The fundamental building block in M3 is the entity. An entity represents information about a set of related real-world objects. Associated with each entity is a set of attributes that indicate what information is captured about each entity. For simplicity, we assume that at most one value can be associated with each attribute of an entity.

The project could benefit from a strong injection of subject identity based thinking and design.

October 30, 2010

8 Keys to Findability

8 Keys to Findability mentions in closing:

The average number of search terms is about 1.7 words, which is not a lot when searching across millions of documents. Therefore, a conversation type of experience where users can get feedback from the results and refine their search makes for the most effective search results.

I have a different take on that factoid.

The average user needs only 1.7 words to identify a subject of interest to them.

Why the gap between 1.7 words and the number of words required for “effective search results?”

Why ask?

Returning millions of “hits” is on 1.7 words is meaningless.

Returning the ten most relevant “hits” on 1.7 words is a G***** killer.

October 29, 2010

VoxPopuLII – Blog

Filed under: Cataloging,Classification,FRBR,Information Retrieval,Legal Informatics — Patrick Durusau @ 5:46 am

VoxPopuLII.

From the blog:

VoxPopuLII is a guest-blogging project sponsored by the Legal Information Institute at the Cornell Law School. It presents the insights of a the very diverse group of people working on legal informatics issues and government information, all around the world. It emphasizes new voices and big ideas.

Not your average blog.

I first encountered: LexML Brazil Project

Questions (about LexML):

  1. What do you think about the strategy to deal with semantic diversity? Pluses? Minuses?
  2. The project says they are following: “Ranganathan’s ‘stratification planes’ classification system…” Your evaluation?
  3. Identify 3 instances of equivalents to the “stratification planes” classification system.
  4. How would you map those 3 instances to Ranganathan’s “stratification planes?”

TMQL Notes from Leipzig

Filed under: Information Retrieval,TMQL,Topic Maps — Patrick Durusau @ 4:48 am

TMQL language proposal – apart from Path Language have been posted to the SC 34 document repository for your review and comments!

Deeply appreciate Lars Marius Garshol leading the discussion.

Now is the time for your comments and suggestions.

Even better, trial implementations of present and requested features.

One of the best ways to argue for a feature is to show it in working code.

Or even better, when applied to show results not otherwise available.

October 28, 2010

19th ACM International Conference on Information and Knowledge Management

Filed under: Conferences,Information Retrieval,Knowledge Management,Software — Patrick Durusau @ 5:50 am

The front matter for 19th ACM international conference on Information and knowledge management is a great argument for ACM membership + Digital Library.

There are 126 papers, any one of which would make for a pleasant afternoon.

I will be mining these for those particularly relevant to topic maps but your suggestions would be appreciated.

  1. What conferences do you follow?
  2. What journals do you follow?
  3. What blogs/websites do you follow?

*****
Visit the ACM main site or its membership page ACM Membership

October 25, 2010

An Evaluation of TS13298 in the Scope of MoReq2

An Evaluation of TS13298 in the Scope of MoReq2. Authors: Gülten Al?r, Thomas Sødring and ?rem Soydal Keywords: TS13298, MoReq2, electronic records management standards.

Abstract:

TS13298 is the first Turkish standard developed for electronic records management. It was published in 2007 and is particularly important when developing e-government services. MoReq2, which was published in 2008 as an initiative of the European Union countries, is an international “de facto” standard within the field of electronic records management. This paper compares and evaluates the content and presentation of the TS13298 and MoReq2 standards, and similarities and differences between the two standards are described. Moreover, the question of how MoReq2 can be used as a reference when updating TS13298 is also dealt with. The method of hermeneutics is used for the evaluation, and the texts of TS13298 and MoReq2 were compared and reviewed. These texts were evaluated in terms of terminology, access control and security, retention and disposition, capture and declaring, search, retrieval, presentation and metadata scheme. We discovered that TS13298 and MoReq2 have some “requirements” in common. However, the MoReq2 requirements, particularly in terms of control and security, retention and disposition, capture and declaration, search and presentation, are both vaster and more detailed than those of TS13298. As a conclusion it is emphasized that it would be convenient to update TS13298 by considering these requirements. Moreover, it would be useful to update and improve TS13298 by evaluating MoReq2 in terms of terminology and metadata scheme.

This article could form the basis for a topic map of these standards to facilitate convergence of these standards.

It also illustrates how a title search on “electronic records” would miss an article of interest.

The Short Comings of Full-Text Searching

The Short Comings of Full-Text Searching by Jeffrey Beall from the University of Colorado Denver.

  1. The synonym problem.
  2. Obsolete terms.
  3. The homonym problem.
  4. Spamming.
  5. Inability to narrow searches by facets.
  6. Inability to sort search results.
  7. The aboutness problem.
  8. Figurative language.
  9. Search words not in web page.
  10. Abstract topics.
  11. Paired topics.
  12. Word lists.
  13. The Dark Web.
  14. Non-textual things.

Questions:

  1. Watch the slide presentation.
  2. Can you give three examples of each short coming? (excluding #5 and #6, which strike me as interface issues, not searching issues)
  3. How would you “solve” the word list issue? (Don’t assume quantum computing, etc. There are simpler answers.)
  4. Is metadata the only approach for “non-textual things?” Can you cite 3 papers offering other approaches?
« Newer PostsOlder Posts »

Powered by WordPress