Archive for the ‘Information Retrieval’ Category

HCIR [Human-Computer Information Retrieval] site gets publication page

Saturday, March 30th, 2013

HCIR site gets publication page by Gene Golovchinsky.

From the post:

Over the past six years of the HCIR series of meetings, we’ve accumulated a number of publications. We’ve had a series of reports about the meetings, papers published in the ACM Digital Library, and an up-coming Special Issue of IP&M. In the run-up to this year’s event (stay tuned!), I decided it might be useful to consolidate these publications in one place. Hence, we now have the HCIR Publications page.

Human-Computer Information Retrieval (HCIR) if the lingo is unfamiliar.

Will ease access to a great set of papers, at least in one respect.

One small improvement:

Do no rely upon the ACM Digital Library as the sole repository for these papers.

Access isn’t an issue for me but I suspect it may be for a number of others.

Hiding information behind a paywall diminishes its impact.

Informer

Wednesday, February 6th, 2013

Informer Newsletter of the BCS Information Retrieval Specialist Group.

The Winter 2013 issue of the Informer has been published!

You will find:

Prior issues are also available.

Sky Survey Data Lacks Standardization [Heterogeneous Big Data]

Tuesday, November 27th, 2012

Sky Survey Data Lacks Standardization by Ian Armas Foster.

From the post:

The Sloan Digital Sky Survey is at the forefront of astronomical research, compiling data from observatories around the world in an effort to truly pinpoint where we lie on the universal map. In order to do that, they must aggregate data from several observatories across the world, an intensive data operation.

According to a report written by researchers at UCLA, even though the SDSS is a data intensive astronomical mapping survey, it has yet to lay down a standardized foundation for retrieving and storing scientific data.

Per sdss.org, the first two projects were responsible for observing “a quarter of the sky” and picking out nearly a million galaxies and over 100,000 quasars. The project started at the Apache Point observatory in New Mexico and has since grown to include 25 observatories across the globe. The SDSS gained recognition in2009 with the Nobel Prize in physics awarded to the advancement of optical fibers and digital imaging detectors (or CCDs) that allowed the project to grow in scale.

The point is that the datasets that the scientists used seemed to be scattered. Some would come about through informal social contacts such as email while others would simply search for necessary datasets on Google. Further, once these datasets were found, there was even an inconsistency in how they were stored before they could be used. However, this may have had to do with the varying sizes of the sets and how quickly the researchers wished to use the data. The entire SDSS dataset consists of over 130 TB, according to the report, and that volume can be slightly unwieldy.

“Large sky surveys, including the SDSS, have significantly shaped research practices in the field of astronomy,” the report concluded. “However, these large data sources have not served to homogenize information retrieval in the field. There is no single, standardized method for discovering, locating, retrieving, and storing astronomy data.”

So, big data isn’t going to be homogeneous big data but heterogeneous big data.

That sounds like an opportunity for topic maps to me.

You?

Will Data Storage Make Us Dumber?

Wednesday, October 10th, 2012

Coming to a data center and then desk top near you:

Case Western Reserve University researchers have developed technology aimed at making an optical disc that holds 1 to 2 terabytes of data – the equivalent of 1,000 to 2,000 copies of Encyclopedia Britannica. The entire print collection of the Library of Congress could fit on five to 10 discs.

Only a matter of time before you have the Library of Congress on a single disk on your local computer. All of it.

Questions:

  • Can you find useful information about a subject?
  • If you find it once, can you find it again?
  • If you can find it again, how much work does it take?
  • Can you share your trail of discovery or “bread crumbs” with others?

If TB data storage means you can’t find information, doesn’t that mean you are getting dumber, one TB at a time?

Storage density isn’t going to slow down so we had better start working on search/IR.

See: Making computer data storage cheaper and easier

Information Retrieval and Search Engines [Committers Needed!]

Wednesday, October 10th, 2012

Information Retrieval and Search Engines

A proposal is pending to create a Q&A site for people interested in information retrieval and search engines.

But it needs people to commit to using it and answering questions!

That could be you!

There’s a lot of action left in information retrieval and search engines.

Don’t have to believe me. Have you tried one lately? ;-)

Using information retrieval technology for a corpus analysis platform

Wednesday, September 26th, 2012

Using information retrieval technology for a corpus analysis platform by Carsten Schnober.

Abstract:

This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut für Deutsche Sprache (IDS Mannheim). It presents a method to use Lucene’s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

The support for multiple annotation layers is of particular interest to me because the “subjects” of interest in a text may vary from one reader to another.

Being mindful that for topic maps, the annotation layers and annotations themselves may be subjects for some purposes.

Center for Intelligent Information Retrieval (CIIR) [University of Massachusetts Amherst]

Tuesday, August 28th, 2012

Center for Intelligent Information Retrieval (CIIR)

From the webpage:

The Center for Intelligent Information Retrieval (CIIR) is one of the leading research groups working in the areas of information retrieval and information extraction. The CIIR studies and develops tools that provide effective and efficient access to large networks of heterogeneous, multimedia information.

CIIR accomplishments include significant research advances in the areas of retrieval models, distributed information retrieval, information filtering, information extraction, topic models, social network analysis, multimedia indexing and retrieval, document image processing, search engine architecture, text mining, structured data retrieval, summarization, evaluation, novelty detection, resource discovery, interfaces and visualization, digital libraries, computational social science, and cross-lingual information retrieval.

The CIIR has published more than 900 papers on these areas, and has worked with over 90 government and industry partners on research and technology transfer. Open source software supported by the Center is being used worldwide.

Please contact us to talk about potential new projects, collaborations, membership, or joining us as a graduate student or visiting researcher.

To get an idea of the range of their activities, visit the publications page and just browse.

SIGIR 2013 : ACM International Conference on Information Retrieval

Monday, August 27th, 2012

SIGIR 2013 : ACM International Conference on Information Retrieval

21 January 2013: Abstracts for full research papers due
28 January 2013: Full research paper due
4 February 2013: Workshop proposals due
18 February 2013: Posters, demonstration, and tutorial proposals due
11 March 2013: Notification of workshop acceptances
11 March 2013: Doctoral consortium proposals due
15 April 2013: All other acceptance notifications
28 July 2013: Conference Begins

From the webpage:

We are delighted to welcome SIGIR 2013 to Dublin, Ireland. SIGIR was last held in Dublin almost 20 years ago in 1994. The intervening years have seen huge growth in the field of information retrieval and we look forward to receiving submissions to help us build an exciting programme reporting latest developments in information retrieval.

Updates to follow but thought you might want extra time to plan for Dublin.

OAIR 2013 : Open Research Areas in Information Retrieval

Monday, August 27th, 2012

OAIR 2013 : Open Research Areas in Information Retrieval

When May 22, 2013 – May 24, 2013
Where Lisbon, Portugal
Submission Deadline Dec 10, 2012
Notification Due Feb 4, 2013

From the homepage:

Welcome to OAIR 2013 (the 10th International Conference in the RIAO series), taking place in Lisbon, Portugal from May 22 to 24, 2013.

The World Wide Web is the largest source of openly accessible data, and the most common means to connect people and share resources.

However, exploiting these interconnected Webs to obtain information is still an unsolved problem. This conference calls for papers describing recent research in Information Retrieval concerning the integration between a Web of Data and a Web of People, to transform pure data into information, and information into usable knowledge.

The Open research Areas in Information Retrieval (OAIR) conference is a triennial conference, addressing research topics related to the design of robust and large-scale scientific and industrial solutions to information processing.

OAIR 2013 conference is an opportunity to show main research activities, to share knowledge among IR scientific community and to get updates on new scientific work developed by IR community.

This conference is connected to the main IR personalities (see Steering Committee list) and a considerable number of attendances are expected.

We look forward to seeing you in the Europe´s Westernmost and sunniest capital, LISBON!

Topics of interest include:

  • Adapting search to Users
  • Advertising and ad targeting
  • Aggregation of Results
  • Community and Context Aware Search
  • Community-based Filtering and Recommender Systems
  • Community-based IR Theory
  • Community-oriented Content Representation
  • Evaluation of Social IR
  • Improving Web via Social Media
  • Including Crowdsourcing in Search
  • Merging Heterogeneous Web Data
  • Modeling the web of people
  • Personal semantics search
  • Query log analysis
  • Personal semantics search
  • Search over Social Networks
  • Sentiment analysis
  • Social Multimedia and Multimodal IR
  • Social Topic detection
  • Structuring Unstructured Data
  • System Architectures for Social IR
  • User Interfaces and Interactive IR

Having connections to data, assuming anyone knows its whereabouts, isn’t quite the same as making use of it.

Natural Language Processing | Hub

Saturday, July 7th, 2012

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at info@cilenis.com

Definitely going on my short list of sites to check!

Negation for Document Re-ranking in Ad-hoc Retrieval

Tuesday, June 5th, 2012

Negation for Document Re-ranking in Ad-hoc Retrieval by Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro.

Interesting slide deck that was pointed out to me by Jack Park.

On the “negation” aspects, I found it helpful to review Word Vectors and Quantum Logic Experiments with negation and disjunction by Dominic Widdows and Stanley Peters (cited as an inspiration by the slide authors).

Depending upon your definition of subject identity and subject sameness, you may find negation/disjunction useful for topic map processing.

Geometric and Quantum Methods for Information Retrieval

Tuesday, June 5th, 2012

Geometric and Quantum Methods for Information Retrieval by Yaoyong Li and Hamish Cunningham.

Abstract:

This paper reviews the recent developments in applying geometric and quantum mechanics methods for information retrieval and natural language processing. It discusses the interesting analogies between components of information retrieval and quantum mechanics. It then describes some quantum mechanics phenomena found in the conventional data analysis and in the psychological experiments for word association. It also presents the applications of the concepts and methods in quantum mechanics such as quantum logic and tensor product to document retrieval and meaning of composite words, respectively. The purpose of the paper is to give the state of the art on and to draw attention of the IR community to the geometric and quantum methods and their potential applications in IR and NLP.

More complex models can (may?) lead to better IR methods, but:

Moreover, as Hilbert space is the mathematical foundation for quantum mechanics (QM), basing IR on Hilbert space creates an analogy between IR and QM and may usefully bring some concepts and methods from QM into IR. (p.24)

is a dubious claim at best.

The “analogy” between QM and IR makes the point:

QM IR
a quantum system a collection of object for retrieval
complex Hilbert space information space
state vector objects in collection
observable query
measurement search
eigenvalues relevant or not for one object
probability of getting one eigenvalue relevance degree of object to query

The authors are comparing apples and oranges. For example, “complex Hilbert space” and “information space.”

A “complex Hilbert space” is a model that has been found useful with another model, one called quantum mechanics.

An “information space,” on the other hand, encompasses models known to use “complex Hilbert spaces” and more. Depends on the information space of interest.

Or the notion of “observable” being paired with “query.”

Complex Hilbert spaces may be quite useful for IR, but tying IR to quantum mechanics isn’t required to make use of it.

Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Tuesday, June 5th, 2012

6th International Workshop on Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Paper Submission: June 21, 2012
Authors Notification: July 10, 2012
Final Paper Submission and Registration: July 24, 2012

In conjunction with International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – IC3K 2012 – 04 – 07 October, 2012 – Barcelona, Spain.

Scope

Nowadays users are more and more interested in information rather than in mere raw data. The huge amount of accessible data sources is growing rapidly. This calls for novel systems providing effective means of searching and retrieving information with the fundamental goal of making it exploitable by humans and machines.
DART focuses on researching and studying new challenges in distributed information filtering and retrieval. In particular, DART aims to investigate novel systems and tools to distributed scenarios and environments. DART will contribute to discuss and compare suitable novel solutions based on intelligent techniques and applied in real-world applications.
Information Retrieval attempts to address similar filtering and ranking problems for pieces of information such as links, pages, and documents. Information Retrieval systems generally focus on the development of global retrieval techniques, often neglecting individual user needs and preferences.
Information Filtering has drastically changed the way information seekers find what they are searching for. In fact, they effectively prune large information spaces and help users in selecting items that best meet their needs, interests, preferences, and tastes. These systems rely strongly on the use of various machine learning tools and algorithms for learning how to rank items and predict user evaluation.

Topics of Interest

Topics of interest will include (but not are limited to):

  • Web Information Filtering and Retrieval
  • Web Personalization and Recommendation
  • Web Advertising
  • Web Agents
  • Web of Data
  • Semantic Web
  • Linked Data
  • Semantics and Ontology Engineering
  • Search for Social Networks and Social Media
  • Natural Language and Information Retrieval in the Social Web
  • Real-time Search
  • Text categorization

If you are interested and have the time (or graduate students with the time), abstracts from prior conferences are here. Would be a useful exercise to search out publicly available copies. (As far as I can tell, no abstracts from DART.)

Are visual dictionaries generalizable?

Sunday, May 13th, 2012

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres

Abstract:

Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

Saving the Old IR Literature: a new batch

Saturday, April 21st, 2012

Saving the Old IR Literature: a new batch

Saw a retweet of a tweet from @djoerd on this new release.

Thanks ACM SIGIR! (Special Interest Group on Information Retrieval)

Just the titles should get you interested:

  • Natural Language in Information Retrieval – Donald E. Walker, Hans Karlgren, Martin Kay – Skriptor AB, Stockholm, 1977
  • Annual Report: Automatic Informative Abstracting and Extracting – L. L. Earl – Lockheed Missiles and Space Company, 1972
  • Free Text Retrieval Evaluation – Pauline Atherton, Kenneth H. Cook, Jeffrey Katzer – Syracuse University, 1972
  • Information Storage and Retrieval: Scientific Report No. ISR-7 – Gerard Salton – The National Science Foundation, 1964
  • Information Storage and Retrieval: Scientific Report No. ISR-8 – Gerard Salton – The National Science Foundation, 1964
  • Information Storage and Retrieval: Scientific Report No. ISR-9 – Gerard Salton – The National Science Foundation, 1965
  • Information Storage and Retrieval: Scientific Report No. ISR-14 – Gerard Salton – The National Science Foundation, 1968
  • Information Storage and Retrieval: Scientific Report No. ISR-16 – Gerard Salton – The National Science Foundation, 1969
  • Automatic Indexing: A State of the Art Review – Karen Sparck Jones – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5193, 1974
  • Final Report on International Research Forum in Information Science: The Theoretical Basis of Information Science – B.C. Vickery, S.E. Robertson, N.J. Belkin – British Library Research and Development Report No. 5262, 1975
  • Report on the Need for and Provision for an ‘IDEAL’Information Retrieval Test Collection – K. Sparck Jones, C.J. Van Rijsbergen – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5266, 1975
  • Report on a Design Study for the’IDEAL’ Information Retrieval Test Collection – K. Sparck Jones, R.G. Bates – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5428, 1977
  • Research on Automatic Indexing 1974-1976, Volume 1: Text – K. Sparck Jones, R.G. Bates – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5464, 1977
  • Statistical Bases of Relevance Assessment for the ‘IDEAL’ Information Retrieval Test Collection – H. Gilbert, K. Sparck Jones – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5481, 1979
  • Design Study for an Anomalous State of Knowledge Based Information Retrieval System – N.J. Belkin, R.N. Oddy – University of Aston, Computer CentreBritish Library Research and Development Report No. 5547, 1979
  • Research on Relevance Weighting, 1976-1979 – K. Sparck Jones, C.A. Webster – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5553, 1980
  • New Models in Probabilistic Information Retrieval – C.J. van Rijsbergen, S.E. Robertson, M.F. Porter – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5587, 1980
  • Statistical problems in the application of probabilistic models to information retrieval – S.E. Robertson, J.D. Bovey – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5739, 1982
  • A front-end for IR experiments – S.E. Robertson, J.D. Bovey – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5807, 1983
  • An operational evaluation of weighting, ranking and relevance feedback via a front-end system – S.E. Robertson, C.L. Thompson – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5549, 1987
  • Okapi at City: An evaluation facility for interactive – Stephen Walker, Micheline Hancock-Beaulieu – Centre for Information Science City UniversityBritish Library Research and Development Report No. 6056, 1991
  • Improving Subject Retrieval in Online Catalogues: Stemming, automatic spelling correction and cross-reference tables – Stephen Walker, Richard M Jones – The Polytechnic of Central LondonBritish Library Research Paper No. 24, 1987
  • Designing an Online Public Access Catalogue: Okapi, a catalogue on a local area network – Nathalie Nadia Mitev, Gillian M Venner, Stephen Walker – The Polytechnic of Central LondonLibrary and Information Research Report 39, 1985
  • Improving Subject Retrieval in Online Catalogues: Relevance feedback and query expansion – Stephen Walker, Rachel De Vere – The Polytechnic of Central LondonBritish Library Research Paper No. 72, 1989
  • Evaluation of Online Catalogues: an assessment of methods – Micheline Hancock-Beaulieu, Stephen Robertson, Colin Neilson – Centre for Information Science City UniversityBritish Library Research Paper No. 78, 1990

Information Retrieval: Berkeley School of Information

Sunday, April 15th, 2012

Information Retrieval: Berkeley School of Information

The PDFs are password protected (on the outline) but the course slides are available.

Good slides by the way. Particularly the illustrations.

The course used one of the mini-TREC data sets.

If you are not familiar with TREC, you should be.

History of Information Organization (Infographic)

Thursday, March 8th, 2012

From Cartography to Card Catalogs [Infographic]: History of Information Organization

Mindjet has posted an infographic and blog post about the history of information organization. I have embedded the graphic below.

Let me preface my remarks by saying I have known people at Mindjet and it is a fairly remarkable organization. And to be fair, the history of information organization is of interest to me, although I am far from being a specialist in the field.

However, when a graphic jumps from “850 CE The First Byzantine Encyclopedia,” to “1276 CE Oldest Continuously Functioning Library” and informs the reader on the edge in between that was “3,000 years ago,” it seems to be lacking in precision or proofing, perhaps both.

Although information has to be summarized for such a presentation, I thought the rise of writing in Egypt/Sumeria would have merited a note, perhaps the library of Ashurbanipal (first library of the ancient Middle East) or the Library of Alexandria, just to name two. Noting you would have to go before Ashurbanipal to get 3,000 years ago. And there were written texts and collections of such texts for anywhere from 2,000 to 3,000 years before that.

I do appreciate that Mindjet doesn’t think information issues arose with the digital computer. I am hopeful that they will encourage a re-examination of older methods and solutions in hopes of finding clues to new solutions.

A Survey of Automatic Query Expansion in Information Retrieval

Saturday, February 25th, 2012

A Survey of Automatic Query Expansion in Information Retrieval by Claudio Carpineto, Giovanni Romano.

Abstract:

The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning. AQE has a long history in the information retrieval community but it is only in the last years that it has reached a level of scientific and experimental maturity, especially in laboratory settings such as TREC. This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques. The following questions are addressed. Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?

Have you heard topic maps described as being the solution to the following problem?

The most critical language issue for retrieval effectiveness is the term mismatch problem: the indexers and the users do often not use the same words. This is known as the vocabulary problem Furnas et al. [1987], compounded by synonymy (same word with different meanings, such as “java”) and polysemy (different words with the same or similar meanings, such as “tv” and “television”). Synonymy, together with word inflections (such as with plural forms, “television” versus “televisions”), may result in a failure to retrieve relevant documents, with a decrease in recall (the ability of the system to retrieve all relevant documents). Polysemy may cause retrieval of erroneous or irrelevant documents, thus implying a decrease in precision (the ability of the system to retrieve only relevant documents).

That sounds like the XWindows index merging problem doesn’t it? (Different terms being used by *nix vendors who wanted to use a common set of XWindows documentation.)

The authors describe the amount of data on the web searched with only one, two or three terms:

In this situation, the vocabulary problem has become even more serious because the paucity of query terms reduces the possibility of handling synonymy while the heterogeneity and size of data make the effects of polysemy more severe.

But the size of the data isn’t a given. What if a topic map with scoped names were used to delimit the sites searched using a particular identifier.

For example, a topic could have the name: “TRIM19″ and a scope of: “http://www.ncbi.nlm.nih.gov/gene.” If you try a search with “TRIM19″ at the scoping site, you get a very different result than if you use “TRIM19″ with say “http://www.google.com.”

Try it, I’ll wait.

Now, imagine that your scoping topic on “TRIM19″ isn’t just that one site but a topic that represents all the gene database sites known to you. I don’t know the number but it can’t be very large, at least when compared to the WWW.

That simple act of delimiting the range of your searches, makes them far less subject to polysemy.

Not to mention that a topic map could be used to supply terms for use in automated query expansion.

BTW, the survey is quite interesting and deserves a slow read with follow up on the cited references.

Attention-enhancing information retrieval

Monday, February 20th, 2012

Attention-enhancing information retrieval

William Webber writes:

Last week I was at SWIRL, the occasional talkshop on the future of information retrieval. To me the most important of the presentations was Dianne Kelly’s “Rage against the Machine Learning”, in which she observed the way information retrieval currently works has changed the way people think. In particular, she proposed that the combination of short query with snippet response has reworked peoples’ plastic brains to focus on working memory, and forgo the processing of information required for it to lay its tracks down in our long term memory. In short, it makes us transactionally adept, but stops us from learning.

This is as important as Bret Victor’s presentation.

I particularly liked the line:

Various fanciful scenarios were given, but the ultimate end-point of such a research direction is that you walk into the shopping mall, and then your mobile phone leads you round telling you what to buy.

Reminds me of a line I remember imperfectly as judging from advertising, we are all “…insecure, sex-starved neurotics with 15-second attention spans.”

I always thought that was being generous on the attention span but opinions differ on that point. ;-)

How do you envision your users? Serious question but not one you have to answer here. Ask yourself.

KDIR 2012 : International Conference on Knowledge Discovery and Information

Wednesday, February 15th, 2012

KDIR 2012 : International Conference on Knowledge Discovery and Information

Regular Paper Submission: April 17, 2012
Authors Notification (regular papers): June 12, 2012
Final Regular Paper Submission and Registration: July 4, 2012

From the call for papers:

Knowledge Discovery is an interdisciplinary area focusing upon methodologies for identifying valid, novel, potentially useful and meaningful patterns from data, often based on underlying large data sets. A major aspect of Knowledge Discovery is data mining, i.e. applying data analysis and discovery algorithms that produce a particular enumeration of patterns (or models) over the data. Knowledge Discovery also includes the evaluation of patterns and identification of which add to knowledge. This has proven to be a promising approach for enhancing the intelligence of software systems and services. The ongoing rapid growth of online data due to the Internet and the widespread use of large databases have created an important need for knowledge discovery methodologies. The challenge of extracting knowledge from data draws upon research in a large number of disciplines including statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions.

Information retrieval (IR) is concerned with gathering relevant information from unstructured and semantically fuzzy data in texts and other media, searching for information within documents and for metadata about documents, as well as searching relational databases and the Web. Automation of information retrieval enables the reduction of what has been called “information overload”.

Information retrieval can be combined with knowledge discovery to create software tools that empower users of decision support systems to better understand and use the knowledge underlying large data sets.

Part of IC3K 2012 – International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.

Scienceography: the study of how science is written

Tuesday, February 14th, 2012

Scienceography: the study of how science is written by Graham Cormode, S. Muthukrishnan and Jinyun Yun.

Abstract:

Scientific literature has itself been the subject of much scientific study, for a variety of reasons: understanding how results are communicated, how ideas spread, and assessing the influence of areas or individuals. However, most prior work has focused on extracting and analyzing citation and stylistic patterns. In this work, we introduce the notion of ‘scienceography’, which focuses on the writing of science. We provide a first large scale study using data derived from the arXiv e-print repository. Crucially, our data includes the “source code” of scientific papers-the LATEX source-which enables us to study features not present in the “final product”, such as the tools used and private comments between authors. Our study identifies broad patterns and trends in two example areas-computer science and mathematics-as well as highlighting key differences in the way that science is written in these fields. Finally, we outline future directions to extend the new topic of scienceography.

What content are you searching/indexing in a scientific context?

The authors discover what many of us have overlooked. The “source” of scientific papers. A source that can reflects a richer history than the final product.

Some questions:

Will searching the source give us finer grained access to the content? That is can we separate portions of text that recite history, related research, background, from new insights/conclusions? To access the other material only if needed. (Every graph paper starts off with nodes and edges, complete with citations. Anyone reading a graph paper is likely to know those terms.)

Other disciplines use LaTeX. Do those LaTeX files differ from the ones reported here? If so, in what way?

ISO 25964-­-1 Thesauri for information retrieval

Friday, January 20th, 2012

Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval

Actually that is the homepage for Networked Knowledge Organization Systems/Services – N K O S but the lead announcement item is for ISO 25964-1, etc.

From that webpage:

New international thesaurus standard published

ISO 25964-­-1 is the new international standard for thesauri, replacing ISO 2788 and ISO 5964. The full title is Information and documentation -­- Thesauri and interoperability with other vocabularies -­- Part 1: Thesauri for information retrieval. As well as covering monolingual and multilingual thesauri, it addresses 21st century needs for data sharing, networking and interoperability.

Content includes:

  • construction of mono-­- and multi-­-lingual thesauri;
  • clarification of the distinction between terms and concepts, and their inter-­-relationships;
  • guidance on facet analysis and layout;
  • guidance on the use of thesauri in computerized and networked systems;
  • best practice for the management and maintenance of thesaurus development;
  • guidelines for thesaurus management software;
  • a data model for monolingual and multilingual thesauri;
  • brief recommendations for exchange formats and protocols.

An XML schema for data exchange has been derived from the data model, and is available free of charge at http://www.niso.org/schemas/iso25964/ . Coming next ISO 25964-­-1 is the first of two publications. Part 2: Interoperability with other vocabularies is in the public review stage and will be available by the end of 2012.

Find out how you can obtain a copy from the news release.

Let me help you there, the correct number is: ISO 25964-1:2011 and the list price for a PDF copy is CHF 238,00, or in US currency (today), $257.66 (for 152 pages).

Shows what I know about semantic interoperability.

If you want semantic interoperability, you change people $1.69 per page (152 pages) for access to the principles of thesauri to be used for information retrieval.

ISO/IEC and JTC 1 are all parts of a system of viable international (read non-vendor dominated) organizations for information/data standards. They are the natural homes for the management of data integration standards that transcend temporal, organizational, governmental and even national boundaries.

But those roles will not fall to them by default. They must seize the initiative and those roles. Clinging to old-style publishing models for support makes them appear timid in the face of current challenges.

Even vendors recognize their inability to create level playing fields for technology/information standards. And the benefits that come to vendors from de jure as well as non-de jure standards organizations.

ISO/IEC/JTC1, provided they take the initiative, can provide an international, de jure home for standards that form the basis for information retrieval and integration.

The first step to take is to make ISO/IEC/JTC1 information standards publicly available by default.

The second step is to call up all members and beneficiaries, both direct and indirect, of ISO/IEC/JTC 1 work, to assist in the creation of mechanisms to support the vital roles played by ISO/IEC/JTC 1 as de jure standards bodies.

We can all learn something from ISO 25964-1 but how many of us will with that sticker price?

IR – Foundation?

Tuesday, December 6th, 2011

I find find the following statement troubling. See if you can see what’s missing from:

In terms of research, the area may be studied from two rather distinct and complementary points of view: a computer-centered one and a human-centered one. In the computer-centered view, IR consists mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms to improve the results. In the human-centered view, IR consists mainly of studying the behavior of the user, understanding their main needs, and of determining how such understanding affects the organization and operation of the retrieval system. In this book, we focus mainly on the computer-centered view of IR, which is dominant in academia and in the market place. (page 1, Modern Information Retrieval, 2nd ed., Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Pearson 2011)

I am not challenging the accuracy of the statement. Although I might explain some of it differently from the authors.

The terminology by which computer-centered IR is described is one clue: “….efficient…, ….high performance, ….improve the results.” That is computer-centered IR is mostly concerned with measurable results. Things to which we can put numbers and rank one as higher than others. Nothing wrong with that. Personally I have a great deal of interest in such approaches.

Human-centered IR is said: “….behavior…, ….needs, ….understanding….organization and operation….” Human-centered IR is mostly concerned with how users perform IR. Not as measurable but just as important as computer-centered IR. The authors point out, computer-centered IR dominates in academia and in the market place. I suspect because what can be easily measured is more attractive.

Do you notice something missing yet?

I thought it was quite remarkable that semantics weren’t mentioned. That is whatever computer or human centered approaches you take, the efficacy of those are going to vary by the semantics of the language on which IR is being performed. If that seems like an odd claim, consider the utility of an IR system that does not properly sort European much less Asian words, whether written in their scripts or transliteration.

True enough, we can make an IR system that is very fast that simply ignores the correct sort orders for such languages and in the past have taught readers of such languages to accept what the IR system was providing. So the behavior of the users was adapted to the systems. Human-centered I suppose but not the way I usually think about it.

And, after all, semantics are the reason we want to do IR in the first place. If the contents we were searching had no semantics, it is very unlikely we would want to search them at all. No matter now efficient or well organized a system might be.

My real concern is that semantics are being assumed as a matter of course. We all “know” the semantics. Hardly worth discussing. But that is why search results so seldom meet our expectations. We didn’t discuss the semantics up front. Everyone from system architect, programmer, UI designer, content author, all the way to and including the searcher, “knew” the semantics.

Trouble is, the semantics they “know,” are often different.

Of course the authors are free to include or exclude any content they wish and to fully cover semantic issues in general, would require a volume at least as long as this one. (A little over 900 pages with the index.)

I would start with something like:

to make the point that we always start with languages and semantics and that data/texts are recorded in systems using languages and semantics. Our data structures are not neutral bystanders. They determine as much of what we can find as they determine the how we will interpret it.

Try running a modern genealogy for someone and when you find an arrest record for being a war criminal or child molester of a close relative, see if the family wants that included. Suddenly that will be more important that other prizes or honors they have won. Still the same person but the label on the data, arrest record, makes us suspect the worse. Had it read: “False Arrests, a record of false charges during the regime of XXX,” we are likely to react differently.

I am going to use Baeza-Yates and Ribeiro-Neto as one of the required texts in the next topic maps class. So we can cover some of the mining techniques that will help populate topic maps.

But I will also cover the issue of languages/semantics as well as data/texts (in how they are stored and the semantics of the same).

Does anyone have a favorite single volume on languages/semantics. I would lean towards Doing What Comes Naturally by Stanley Fish but I am sure there are other volumes equally as good.

The data/text formats an their semantics is likely to be harder to come by. I don’t know of anything off hand that is focused on that in monograph length treatment. Suggestions?

PS: I know I got the image wrong but I am about to post. I will post a slightly amended image tomorrow when I have thought about it some more.

Don’t let that deter you from posting criticisms of the current image in the meantime.

Template-Based Information Extraction without the Templates

Monday, November 28th, 2011

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.

Abstract:

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

Information retrieval model based on graph comparison

Friday, November 18th, 2011

Information retrieval model based on graph comparison (pdf) Quoc-Dinh Truong, Taoufiq Dkaki, Josiane Mothe, Pierre-Jean Charrel.

We propose a new method for Information Retrieval (IR) based on graph vertices comparison. The main goal of this method is to enhance the core IR-process of finding relevant documents in a collection of documents according to a user’s needs. The method we propose is based on graph comparison and involves recursive computation of similarity. In the framework this approach, documents, queries and indexing terms are viewed as vertices of a bipartite graph where edges go from a document or a query – first node type- to an indexing term – second node type-. Edges reflect the link that exists between documents or queries on the one hand and indexing terms on the other hand. In our model, graph edge settings reflect the tf-ifd paradigm. The proposed similarity measure instantiates and extends this principle, stipulating that the resemblance of two items or objects can be computed using the similarities of the items to which they are related. Our method also takes into account the concept of similarity propagation over graph edges.

Experiments conducted using four small sized IR test collections (TREC 2004 Novelty Track, CISI, Cranfield & Medline) demonstrate the effectiveness of our approach and its feasibility as long as the graph size does not exceed a few thousand nodes. The experiment’s results show that our method outperforms the vector-based model. Our method actually highly outperforms the vector-based cosine model, sometimes by more than doubling the precision, up to the top sixty returned documents. The computational complexity issue is resituated in the context of MAC-FAC approaches – many are called but few are chosen. More precisely, we suggest that our method can be successfully used as a FAC stage combined with a fast and computationally cheap method used as a MAC stage.

Very interesting article. Perhaps more so because searches of DBLP and Citeseer show no other publications by this author. A singularity that appears in 2008. I haven’t taken the time to look more deeply but commend the paper to your attention.

If you have pointers to later (earlier?) work by the same author, email or comments would be appreciated.

Stephen Robertson on Why Recall Matters

Monday, November 14th, 2011

Stephen Robertson on Why Recall Matters November 14th, 2011 by Daniel Tunkelang.

Daniel has the slides and an extensive summary of the presentation. Just to give you an taste of what awaits at Daniel’s post:

Stephen started by reminding us of ancient times (i.e., before the web), when at least some IR researchers thought in terms of set retrieval rather than ranked retrieval. He reminded us of the precision and recall “devices” that he’d described in his Salton Award Lecture — an idea he attributed to the late Cranfield pioneer Cyril Cleverdon. He noted that, while set retrieval uses distinct precision and recall devices, ranking conflates both into decision of where to truncate a ranked result list. He also pointed out an interesting asymmetry in the conventional notion of precision-recall tradeoff: while returning more results can only increase recall, there is no certainly that the additional results will decrease precision. Rather, this decrease is a hypothesis that we associate with systems designed to implement the probability ranking principle, returning results in decreasing order of probability of relevance.

Interested? There’s more where that came from, see like to Daniel’s post above.

HCIR 2011 keynote

Saturday, November 12th, 2011

HCIR 2011 keynote by Gene Golovchinsky

From the post:

HCIR 2011 took place almost three weeks ago, but I am just getting caught up after a week at CIKM 2011 and an actual almost-no-internet-access vacation. I wanted to start off my reflections on HCIR with a summary of Gary Marchionini‘s keynote, titled “HCIR: Now the Tricky Part.” Gary coined the term “HCIR” and has been a persuasive advocate of the concepts represented by the term. The talk used three case studies of HCIR projects as a lens to focus the audience’s attention on one of the main challenges of HCIR: how to evaluate the systems we build.

The projects reviewed are themselves worthy of separate treatments, at length.

Gene’s summary makes one wish for video of the keynote. Perhaps I have overlooked it? If so, please post the link.

A Taxonomy of Enterprise Search and Discovery

Friday, November 4th, 2011

A Taxonomy of Enterprise Search and Discovery by Tony Russell-Rose.

Abstract:

Classic IR (information retrieval) is predicated on the notion of users searching for information in order to satisfy a particular “information need”. However, it is now accepted that much of what we recognize as search behaviour is often not informational per se. Broder (2002) has shown that the need underlying a given web search could in fact be navigational (e.g. to find a particular site) or transactional (e.g. through online shopping, social media, etc.). Similarly, Rose & Levinson (2004) have identified the consumption of online resources as a further common category of search behaviour.

In this paper, we extend this work to the enterprise context, examining the needs and behaviours of individuals across a range of search and discovery scenarios within various types of enterprise. We present an initial taxonomy of “discovery modes”, and discuss some initial implications for the design of more effective search and discovery platforms and tools.

If you are flogging software/interfaces for search/discovery in an enterprise context, you really need to read this paper. In part because of their initial findings but in part to establish the legitimacy of evaluating how users search before designing an interface for them to search with. They may not be able to articulate all their search behaviors which means you will have to do some observation to establish what may be the elements that make a difference in a successful interface and one that is less so. (No one wants to be the next Virtual Case Management project at the FBI.)

Read the various types of searching as rough guides to what you may find true for your users. When in doubt, trust your observations of and feedback from your users. Otherwise you will have an interface that fits an abstract description in a paper but not your users. I leave it for you to judge which one results in repeat business.

Don’t take that as a criticism of the paper, I think it is one of the best I have read lately. My concern is that the evaluation of user needs/behaviour be an ongoing process and not prematurely fixed or obscured by categories or typologies of how users “ought” to act.

The paper is also available in PDF format.

Information Literacy 2.0

Friday, November 4th, 2011

Information Literacy 2.0 by Meredith Farkas.

From the post:

Critical inquiry in the age of social media

Ideas about information literacy have always adapted to changes in the information environment. The birth of the web made it necessary for librarians to shift more towards teaching search strategies and evaluation of sources. The tool-focused “bibliographic instruction” approach was later replaced by the skill-focused “information literacy” approach. Now, with the growth of Web 2.0 technologies, we need to start shifting towards providing instruction that will enable our patrons to be successful information seekers in the Web 2.0 environment, where the process of evaluation is quite a bit more nuanced.

Critical inquiry skills are among the most important in a world in which the half-life of information is rapidly shrinking. These days, what you know is almost less important than what you can find out. And finding out today requires a set of skills that are very different from what most libraries focus on. In addition to academic sources, a huge wealth of content is being produced by people every day in knowledgebases like Wikipedia, review sites like Trip Advisor, and in blogs. Some of this content is legitimate and valuable—but some of it isn’t.

While I agree with Meredith that evaluation of information is a critical skill, I am less convinced that it is a new one. Research, even pre-Internet, was never about simply finding resources for the purpose of citation. There always was an evaluative aspect with regard to sources.

I was able to take a doctoral seminar in research methods for Old Testament students that taught critical evaluation of resources. I don’t remember the text off hand but we were reading a transcription of a cuneiform text which had a suggested “emendation” (think added characters) for a broken place in the text. The professor asked whether we should accept the “emendation” or not and on what basis we would make that judgement. The article was by a known scholar so of course we argued about the “emendation” but never asked one critical question: What about the original text? The source the scholar was relying upon.

The theology library had a publication with an image of the text that we reviewed for the next class. Even though it was only a photograph, it was clear that you might get one, maybe two characters in the broken space of the text, but there was no way you would have the five or six required by the “emendation.”

We were told to never rely upon quotations, transcriptions of texts, etc., unless there was simply no way to verify the source. Not that many of us do that in practice but that is the ideal. There is even less excuse for relying on quotations and other secondary materials now that so many primary materials are easy to access online and more are coming online every day.

I think the lesson of information literacy 2.0 should be critical evaluation of information but as part of that evaluation to seek out the sources of the information. You would be surprised how many times what an authors said is not what they are quoted as saying, when read in the context of the original.

HCIR 2011

Thursday, October 27th, 2011

HCIR 2011 Papers

From the homepage:

The Fifth Workshop on Human-Computer Interaction and Information Retrieval took place all day on Thursday, October 20th, 2011, at Google’s main campus in Mountain View, California. There was a reception on Wednesday evening before the workshop, which attracted about a hundred participants.

By my count fourteen (14) papers and twenty-eight (28) posters.

Quite a gold mine of material and I look forward to a long weekend with them!

Enjoy!

PS: Interesting that papers from prior conferences only start to be available starting in 2010.