Archive for the ‘Information Retrieval’ Category

Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Thursday, November 3rd, 2016

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Interfaces available for various major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

  • It’s copy-n-paste, you didn’t have to write it
  • It’s appeal to authority (Stanford)
  • It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

SIGIR 2015 Technical Track

Monday, May 4th, 2015

SIGIR 2015 Technical Track

The list of accepted papers for SIGIR 2015 Technical Track have been published!

As if you need any further justification to attend the conference in Santiago, Chile, August 9-13, 2015.

Curious, would anyone be interested in a program listing that links the authors to their DBLP listings? Just in case you want to catch up on their recent publications before the conference?


Deep Learning: Methods and Applications

Tuesday, January 13th, 2015

Deep Learning: Methods and Applications by Li Deng and Dong Yu. (Li Deng and Dong Yu (2014), “Deep Learning: Methods and Applications”, Foundations and Trends® in Signal Processing: Vol. 7: No. 3–4, pp 197-387.


This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.


Deep learning, Machine learning, Artificial intelligence, Neural networks, Deep neural networks, Deep stacking networks, Autoencoders, Supervised learning, Unsupervised learning, Hybrid deep networks, Object recognition, Computer vision, Natural language processing, Language models, Multi-task learning, Multi-modal processing

If you are looking for another rich review of the area of deep learning, you have found the right place. Resources, conferences, primary materials, etc. abound.

Don’t be thrown off by the pagination. This is issues 3 and 4 of the periodical Foundations and Trends® in Signal Processing. You are looking at the complete text.

Be sure to read Selected Applications in Information Retrieval (Section 9, pages 308-319). Where 9.2 starts with:

Here we discuss the “semantic hashing” approach for the application of deep autoencoders to document indexing and retrieval as published in [159, 314]. It is shown that the hidden variables in the final layer of a DBN not only are easy to infer after using an approximation based on feed-forward propagation, but they also give a better representation of each document, based on the word-count features, than the widely used latent semantic analysis and the traditional TF-IDF approach for information retrieval. Using the compact code produced by deep autoencoders, documents are mapped to memory addresses in such a way that semantically similar text documents are located at nearby addresses to facilitate rapid document retrieval. The mapping from a word-count vector to its compact code is highly efficient, requiring only a matrix multiplication and a subsequent sigmoid function evaluation for each hidden layer in the encoder part of the network.

That is only one of the applications detailed in this work. I do wonder if this will be the approach that breaks the “document” (as in this work for example) model of information retrieval? If I am searching for “deep learning” and “information retrieval,” a search result that returns these pages would be a great improvement over the entire document. (At the user’s option.)

Before the literature on deep learning gets much more out of hand, now would be a good time to start building not only a corpus of the literature but a sub-document level topic map to ideas and motifs as they develop. That would be particularly useful as patents start to appear for applications of deep learning. (Not a volunteer or charitable venture.)

I first saw this in a tweet by StatFact.

This is your Brain on Big Data: A Review of “The Organized Mind”

Monday, November 17th, 2014

This is your Brain on Big Data: A Review of “The Organized Mind” by Stephen Few.

From the post:

In the past few years, several fine books have been written by neuroscientists. In this blog I’ve reviewed those that are most useful and placed Daniel Kahneman’s Thinking, Fast & Slow at the top of the heap. I’ve now found its worthy companion: The Organized Mind: Thinking Straight in the Age of Information Overload.

the organized mind - book cover

This new book by Daniel J. Levitin explains how our brains have evolved to process information and he applies this knowledge to several of the most important realms of life: our homes, our social connections, our time, our businesses, our decisions, and the education of our children. Knowing how our minds manage attention and memory, especially their limitations and the ways that we can offload and organize information to work around these limitations, is essential for anyone who works with data.

See Stephen’s review for an excerpt from the introduction and summary comments on the work as a whole.

I am particularly looking forward to reading Levitin’s take on the transfer of information tasks to us and the resulting cognitive overload.

I don’t have the volume, yet, but it occurs to me that the shift from indexes (Readers Guide to Periodical Literature and the like) and librarians to full text search engines, is yet another example of the transfer of information tasks to us.

Indexers and librarians do a better job of finding information than we do because discovery of information is a difficult intellectual task. Well, perhaps, discovering relevant and useful information is a difficult task. Almost without exception, every search produces a result on major search engines. Perhaps not a useful result but a result none the less.

Using indexers and librarians will produce a line item in someone’s budget. What is needed is research on the differential between the results from indexer/librarians versus us and what that translates to as a line item in enterprise budgets.

That type of research could influence university, government and corporate budgets as the information age moves into high gear.

The Organized Mind by Daniel J. Levitin is a must have for the holiday wish list!

Extended Artificial Memory:…

Monday, October 27th, 2014

Extended Artificial Memory: Toward an Integral Cognitive Theory of Memory and Technology by Lars Ludwig. (PDF) (Or you can contribute to the cause by purchasing a printed or Kindle copy of: Information Technology Rethought as Memory Extension: Toward an integral cognitive theory of memory and technology.)

Convention book selling wisdom is that a title should provoke people to pick up the book. First step towards a sale. Must be the thinking behind this title. Just screams “Read ME!”


Seriously, I have read some of the PDF version and this is going on the my holiday wish list as a hard copy request.


This thesis introduces extended artificial memory, an integral cognitive theory of memory and technology. It combines cross-scientific analysis and synthesis for the design of a general system of essential knowledge-technological processes on a sound theoretical basis. The elaboration of this theory was accompanied by a long-term experiment for understanding [Erkenntnisexperiment]. This experiment included the agile development of a software prototype (Artificial Memory) for personal knowledge management.

In the introductory chapter 1.1 (Scientific Challenges of Memory Research), the negative effects of terminological ambiguity and isolated theorizing to memory research are discussed.

Chapter 2 focuses on technology. The traditional idea of technology is questioned. Technology is reinterpreted as a cognitive actuation process structured in correspondence with a substitution process. The origin of technological capacities is found in the evolution of eusociality. In chapter 2.2, a cognitive-technological model is sketched. In this thesis, the focus is on content technology rather than functional technology. Chapter 2.3 deals with different types of media. Chapter 2.4 introduces the technological role of language-artifacts from different perspectives, combining numerous philosophical and historical considerations. The ideas of chapter 2.5 go beyond traditional linguistics and knowledge management, stressing individual constraints of language and limits of artificial intelligence. Chapter 2.6 develops an improved semantic network model, considering closely associated theories.

Chapter 3 gives a detailed description of the universal memory process enabling all cognitive technological processes. The memory theory of Richard Semon is revitalized, elaborated and revised, taking into account important newer results of memory research.

Chapter 4 combines the insights on the technology process and the memory process into a coherent theoretical framework. Chapter 4.3.5 describes four fundamental computer-assisted memory technologies for personally and socially extended artificial memory. They all tackle basic problems of the memory-process (4.3.3). In chapter 4.3.7, the findings are summarized and, in chapter 4.4, extended into a philosophical consideration of knowledge.

Chapter 5 provides insight into the relevant system landscape (5.1) and the software prototype (5.2). After an introduction into basic system functionality, three exemplary, closely interrelated technological innovations are introduced: virtual synsets, semantic tagging, and Linear Unit tagging.

The common memory capture (of two or more speakers) imagery is quite powerful. It highlights a critical aspect of topic maps.

Be forewarned this is European style scholarship, where the reader is assumed to be comfortable with philosophy, linguistics, etc., in addition to the more narrow aspects of computer science.

To see these ideas in practice:

Slides on What is Artificial Memory.

I first saw this in a note from Jack Park, the source of many interesting and useful links, papers and projects.

Understanding Information Retrieval by Using Apache Lucene and Tika

Saturday, October 25th, 2014

Understanding Information Retrieval by Using Apache Lucene and Tika, Part 1

Understanding Information Retrieval by Using Apache Lucene and Tika, Part 2

Understanding Information Retrieval by Using Apache Lucene and Tika, Part 3

by Ana-maria Mihalceanu.

From part 1:

In this tutorial, the Apache Lucene and Apache Tika frameworks will  be explained through their core concepts (e.g.  parsing, mime detection,  content analysis, indexing,  scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. We assume you have a working knowledge of the Java™ programming language and plenty of content to analyze.

Throughout this tutorial, you will learn:

  • how to use Apache Tika’s API and its most relevant functions
  • how to develop code with Apache Lucene API and its most important modules
  • how to integrate Apache Lucene and Apache Tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download)

Part 1 introduces you to Apache Lucene and Apache Tika and concludes by covering automatic extraction of metadata from files with Apache Tika.

Part 2 covers extracting/indexing of content, along with stemming, boosting and scoring. (If any of that sounds unfamiliar, this isn’t the best tutorial for you.)

Part 3 details the highlighting of fragments when they match a search query.

A good tutorial on Apache Lucene and Apache Tika, what parts of them are covered, but there was no coverage of information retrieval. For example, part 3 talks about increasing search “efficiency” without any consideration of what “efficiency” might mean in a particular search context.

Illuminating issues in information retrieval using Apache Lucene and Tika as opposed to coding up an indexing/searching application with no discussion of the potential choices and tradeoffs would make a much better tutorial.

Train online with EMBL-EBI

Saturday, July 12th, 2014

Train online with EMBL-EBI

From the webpage:

Train online provides free courses on Europe’s most widely used data resources, created by experts at EMBL-EBI and collaborating institutes. You do not need to have any previous experience of bioinformatics to benefit from this training. We want to help you to be a highly competent user of our data resources; we are not trying to train you to become a bioinformatician.

You can use Train online to learn in your own time and at your own pace. You can repeat the courses as many times as you like, or just complete part of a course if you want to brush up on how to perform a specific task.

An interesting collection of training materials on bioinformatics resources.

As the webpage says, it won’t train you to be a bioinformatician but it can make you a more effective user of the resource covered.

Keep in mind if you are working in a bioinformatics project or are interested in how other domains organize their information.

I first saw this in a tweet by Neil Saunders which pointed to: Scaling up bioinformatics training online by Ewan Birney.


Saturday, March 29th, 2014

mtx: a swiss-army knife for information retrieval

From the webpage:

mtx is a command-line tool for rapidly trying new ideas in Information Retrieval and Machine Learning.

mtx is the right tool if you secretly wish you could:

  • play with Wikipedia-sized datasets on your laptop
  • do it interactively, like the boys whose data fits in Matlab
  • quickly test that too-good-to-be-true algorithm you see at SIGIR
  • try ungodly concoctions, like BM25-weighted PageRank over ratings
  • cache all intermediate results, so you never have to re-run a month-long job
  • use awk/perl to hack internal data structures half-way through a computation

mtx is made for Unix hackers. It is a shell tool, not a library or an application. It’s designed for interactive use and relies on your shell’s tab-completion and history features. For scripting it, I highly recommend this.

What do you have on your bootable USB stick? 😉

Rule-based deduplication…

Friday, January 17th, 2014

Rule-based deduplication of article records from bibliographic databases by Yu Jiang,


We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments.

I found this report encouraging, particularly when read along side Rule-based Information Extraction is Dead!…, with regard to merging rules authored by human editors.

Both reports indicate a pressing need for more complex rules than matching a URI for purposes of deduplication (merging in topic maps terminology).

I assume such rules would need to be easier for the average users to declare than TMCL.

Rule-based Information Extraction is Dead!…

Sunday, January 5th, 2014

Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems! by Laura Chiticariu, Yunyao Li, and Frederick R. Reiss.


The rise of “Big Data” analytics over unstructured text has led to renewed interest in information extraction (IE). We surveyed the landscape of IE technologies and identified a major disconnect between industry and academia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technology by the academia. We believe the disconnect stems from the way in which the two communities measure the benefits and costs of IE, as well as academia’s perception that rule-based IE is devoid of research challenges. We make a case for the importance of rule-based IE to industry practitioners. We then lay out a research agenda in advancing the state-of-the-art in rule-based IE systems which we believe has the potential to bridge the gap between academic research and industry practice.

After demonstrating the disconnect between industry (rule-based) and academia (ML) approaches to information extraction, the authors propose:

Define standard IE rule language and data model.

If research on rule-based IE is to move forward in a principled way, the community needs a standard way to express rules. We believe that the NLP community can replicate the success of the SQL language in connecting data management research and practice. SQL has been successful largely due to: (1) expressivity: the language provides all primitives required for performing basic manipulation of structured data, (2) extensibility: the language can be extended with new features without fundamental changes to the language, (3)declarativity: the language allows the specification of computation logic without describing its control flow,
thus allowing developers to code what the program should accomplish, rather than how to accomplish it.

On the contrary, both industry and academia would be better served by domain specific declarative languages (DSDLs).

I say “doman specific” because each domain has its own terms and semantics that are embedded in those terms. If we don’t want to repeat the chaos of owl:sameAs, we had better enable users to define and document the semantics they attach to terms, either as operators or as data.

A host of research problems open up when semantic domains are enabled to document the semantics of their data structures and data. How do semantic understandings evolve over time within a community? Rather difficult to answer if its semantics are never documented. What are the best ways to map between the documented semantics of different communities? Again, difficult to answer without pools of documented semantics of different communities.

Not to mention the development of IE and mapping languages, which share a common core value of documenting semantics and extracting information but have specific features for particular domains. There is no reason to expect or hope that a language designed for genomic research will have all the features needed for monetary arbitrage analysis.

Rather than seeking an “Ur” language for documenting semantics/extracting data, industry can demonstrate ROI and academia progress, with targeted, declarative languages that are familiar to members of individual domains.

I first saw this in a tweet by Kyle Wade Grove.

Advances in Neural Information Processing Systems 26

Sunday, December 8th, 2013

Advances in Neural Information Processing Systems 26

The NIPS 2013 conference ended today.

All of the NIPS 2013 papers were posted today.

I count three hundred and sixty (360) papers.

From the NIPS Foundation homepage:

The Foundation: The Neural Information Processing Systems (NIPS) Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences.

The primary focus of the NIPS Foundation is the presentation of a continuing series of professional meetings known as the Neural Information Processing Systems Conference, held over the years at various locations in the United States, Canada and Spain.

Enjoy the proceedings collection!

I first saw this in a tweet by Benoit Maison.

Introduction to Information Retrieval

Wednesday, November 6th, 2013

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

A bit dated now (2008) but the underlying principles of information retrieval remain the same.

I have a hard copy but the additional materials and ability to cut-n-paste will make this a welcome resource!

We’d be pleased to get feedback about how this book works out as a textbook, what is missing, or covered in too much detail, or what is simply wrong. Please send any feedback or comments to: informationretrieval (at) yahoogroups (dot) com

Online resources

Apart from small differences (mainly concerning copy editing and figures), the online editions should have the same content as the print edition.

The following materials are available online. The date of last update is given in parentheses.

Information retrieval resources

A list of information retrieval resources is also available.

Introduction to Information Retrieval: Table of Contents

Front matter (incl. table of notations) pdf

01   Boolean retrieval pdf html

02 The term vocabulary & postings lists pdf html

03 Dictionaries and tolerant retrieval pdf html

04 Index construction pdf html

05 Index compression pdf html

06 Scoring, term weighting & the vector space model pdf html

07 Computing scores in a complete search system pdf html

08 Evaluation in information retrieval pdf html

09 Relevance feedback & query expansion pdf html

10 XML retrieval pdf html

11 Probabilistic information retrieval pdf html

12 Language models for information retrieval pdf html

13 Text classification & Naive Bayes pdf html

14 Vector space classification pdf html

15 Support vector machines & machine learning on documents pdf html

16 Flat clustering pdf html Resources.

17 Hierarchical clustering pdf html

18 Matrix decompositions & latent semantic indexing pdf html

19 Web search basics pdf html

20 Web crawling and indexes pdf html

21 Link analysis pdf html

Bibliography & Index pdf

bibtex file bib

Information Extraction from the Internet

Saturday, August 24th, 2013

Information Extraction from the Internet by Nan Tang.

From the description at Amazon ($116.22):

As the Internet continues to become part of our lives, there now exists an overabundance of reliable information sources on this medium. The temporal and cognitive resources of human beings, however, do not change. “Information Extraction from the Internet” provides methods and tools for Web information extraction and retrieval. Success in this area will greatly enhance business processes and provide information seekers new tools that allow them to reduce their searching time and cost involvement. This book focuses on the latest approaches for Web content extraction, and analyzes the limitations of existing technology and solutions. “Information Extraction from the Internet” includes several interesting and popular topics that are being widely discussed in the area of information extraction: data spasity and field-associated knowledge (Chapters 1–2), Web agent design and mining components (Chapters 3–4), extraction skills on various documents (Chapters 5–7), duplicate detection for music documents (Chapter 8), name disambiguation in digital libraries using Web information (Chapter 9), Web personalization and user-behavior issues (Chapters 10–11), and information retrieval case studies (Chapters 12–14). “Information Extraction from the Internet” is suitable for advanced undergraduate students and postgraduate students. It takes a practical approach rather than a conceptual approach. Moreover, it offers a truly reader-friendly way to get to the subject related to information extraction, making it the ideal resource for any student new to this subject, and providing a definitive guide to anyone in this vibrant and evolving discipline. This book is an invaluable companion for students, from their first encounter with the subject to more advanced studies, while the full-color artworks are designed to present the key concepts with simplicity, clarity, and consistency.

I discovered this volume while searching for the publisher of: On-demand Synonym Extraction Using Suffix Arrays.

As you can see from the description, a wide ranging coverage of information extraction interests.

All of the chapters are free for downloading at the publisher’s site.

iConcepts Press has a number of books and periodicals you may find interesting.

Fundamentals of Information Retrieval: Illustration with Apache Lucene

Wednesday, June 19th, 2013

Fundamentals of Information Retrieval: Illustration with Apache Lucene by Majirus FANSI.

From the description:

Information Retrieval is becoming the principal mean of access to Information. It is now common for web applications to provide interface for free text search. In this talk we start by describing the scientific underpinning of information retrieval. We review the main models on which are based the main search tools, i.e. the Boolean model and the Vector Space Model. We illustrate our talk with a web application based on Lucene. We show that Lucene combines both the Boolean and vector space models.

The presentation will give an overview of what Lucene is, where and how it can be used. We will cover the basic Lucene concepts (index, directory, document, field, term), text analysis (tokenizing, token filtering, sotp words), indexing (how to create an index, how to index documents), and seaching (how to run keyword, phrase, Boolean and other queries). We’ll inspect Lucene indices with Luke.

After this talk, the attendee will get the fundamentals of IR as well as how to apply them to build a search application with Lucene.

I am assuming that the random lines in the background of the slides are an artifact of the recording. Quite annoying.

Otherwise, a great presentation!

HCIR [Human-Computer Information Retrieval] site gets publication page

Saturday, March 30th, 2013

HCIR site gets publication page by Gene Golovchinsky.

From the post:

Over the past six years of the HCIR series of meetings, we’ve accumulated a number of publications. We’ve had a series of reports about the meetings, papers published in the ACM Digital Library, and an up-coming Special Issue of IP&M. In the run-up to this year’s event (stay tuned!), I decided it might be useful to consolidate these publications in one place. Hence, we now have the HCIR Publications page.

Human-Computer Information Retrieval (HCIR) if the lingo is unfamiliar.

Will ease access to a great set of papers, at least in one respect.

One small improvement:

Do no rely upon the ACM Digital Library as the sole repository for these papers.

Access isn’t an issue for me but I suspect it may be for a number of others.

Hiding information behind a paywall diminishes its impact.


Wednesday, February 6th, 2013

Informer Newsletter of the BCS Information Retrieval Specialist Group.

The Winter 2013 issue of the Informer has been published!

You will find:

Prior issues are also available.

Sky Survey Data Lacks Standardization [Heterogeneous Big Data]

Tuesday, November 27th, 2012

Sky Survey Data Lacks Standardization by Ian Armas Foster.

From the post:

The Sloan Digital Sky Survey is at the forefront of astronomical research, compiling data from observatories around the world in an effort to truly pinpoint where we lie on the universal map. In order to do that, they must aggregate data from several observatories across the world, an intensive data operation.

According to a report written by researchers at UCLA, even though the SDSS is a data intensive astronomical mapping survey, it has yet to lay down a standardized foundation for retrieving and storing scientific data.

Per, the first two projects were responsible for observing “a quarter of the sky” and picking out nearly a million galaxies and over 100,000 quasars. The project started at the Apache Point observatory in New Mexico and has since grown to include 25 observatories across the globe. The SDSS gained recognition in2009 with the Nobel Prize in physics awarded to the advancement of optical fibers and digital imaging detectors (or CCDs) that allowed the project to grow in scale.

The point is that the datasets that the scientists used seemed to be scattered. Some would come about through informal social contacts such as email while others would simply search for necessary datasets on Google. Further, once these datasets were found, there was even an inconsistency in how they were stored before they could be used. However, this may have had to do with the varying sizes of the sets and how quickly the researchers wished to use the data. The entire SDSS dataset consists of over 130 TB, according to the report, and that volume can be slightly unwieldy.

“Large sky surveys, including the SDSS, have significantly shaped research practices in the field of astronomy,” the report concluded. “However, these large data sources have not served to homogenize information retrieval in the field. There is no single, standardized method for discovering, locating, retrieving, and storing astronomy data.”

So, big data isn’t going to be homogeneous big data but heterogeneous big data.

That sounds like an opportunity for topic maps to me.


Will Data Storage Make Us Dumber?

Wednesday, October 10th, 2012

Coming to a data center and then desk top near you:

Case Western Reserve University researchers have developed technology aimed at making an optical disc that holds 1 to 2 terabytes of data – the equivalent of 1,000 to 2,000 copies of Encyclopedia Britannica. The entire print collection of the Library of Congress could fit on five to 10 discs.

Only a matter of time before you have the Library of Congress on a single disk on your local computer. All of it.


  • Can you find useful information about a subject?
  • If you find it once, can you find it again?
  • If you can find it again, how much work does it take?
  • Can you share your trail of discovery or “bread crumbs” with others?

If TB data storage means you can’t find information, doesn’t that mean you are getting dumber, one TB at a time?

Storage density isn’t going to slow down so we had better start working on search/IR.

See: Making computer data storage cheaper and easier

Information Retrieval and Search Engines [Committers Needed!]

Wednesday, October 10th, 2012

Information Retrieval and Search Engines

A proposal is pending to create a Q&A site for people interested in information retrieval and search engines.

But it needs people to commit to using it and answering questions!

That could be you!

There’s a lot of action left in information retrieval and search engines.

Don’t have to believe me. Have you tried one lately? 😉

Using information retrieval technology for a corpus analysis platform

Wednesday, September 26th, 2012

Using information retrieval technology for a corpus analysis platform by Carsten Schnober.


This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut für Deutsche Sprache (IDS Mannheim). It presents a method to use Lucene’s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

The support for multiple annotation layers is of particular interest to me because the “subjects” of interest in a text may vary from one reader to another.

Being mindful that for topic maps, the annotation layers and annotations themselves may be subjects for some purposes.

Center for Intelligent Information Retrieval (CIIR) [University of Massachusetts Amherst]

Tuesday, August 28th, 2012

Center for Intelligent Information Retrieval (CIIR)

From the webpage:

The Center for Intelligent Information Retrieval (CIIR) is one of the leading research groups working in the areas of information retrieval and information extraction. The CIIR studies and develops tools that provide effective and efficient access to large networks of heterogeneous, multimedia information.

CIIR accomplishments include significant research advances in the areas of retrieval models, distributed information retrieval, information filtering, information extraction, topic models, social network analysis, multimedia indexing and retrieval, document image processing, search engine architecture, text mining, structured data retrieval, summarization, evaluation, novelty detection, resource discovery, interfaces and visualization, digital libraries, computational social science, and cross-lingual information retrieval.

The CIIR has published more than 900 papers on these areas, and has worked with over 90 government and industry partners on research and technology transfer. Open source software supported by the Center is being used worldwide.

Please contact us to talk about potential new projects, collaborations, membership, or joining us as a graduate student or visiting researcher.

To get an idea of the range of their activities, visit the publications page and just browse.

SIGIR 2013 : ACM International Conference on Information Retrieval

Monday, August 27th, 2012

SIGIR 2013 : ACM International Conference on Information Retrieval

21 January 2013: Abstracts for full research papers due
28 January 2013: Full research paper due
4 February 2013: Workshop proposals due
18 February 2013: Posters, demonstration, and tutorial proposals due
11 March 2013: Notification of workshop acceptances
11 March 2013: Doctoral consortium proposals due
15 April 2013: All other acceptance notifications
28 July 2013: Conference Begins

From the webpage:

We are delighted to welcome SIGIR 2013 to Dublin, Ireland. SIGIR was last held in Dublin almost 20 years ago in 1994. The intervening years have seen huge growth in the field of information retrieval and we look forward to receiving submissions to help us build an exciting programme reporting latest developments in information retrieval.

Updates to follow but thought you might want extra time to plan for Dublin.

OAIR 2013 : Open Research Areas in Information Retrieval

Monday, August 27th, 2012

OAIR 2013 : Open Research Areas in Information Retrieval

When May 22, 2013 – May 24, 2013
Where Lisbon, Portugal
Submission Deadline Dec 10, 2012
Notification Due Feb 4, 2013

From the homepage:

Welcome to OAIR 2013 (the 10th International Conference in the RIAO series), taking place in Lisbon, Portugal from May 22 to 24, 2013.

The World Wide Web is the largest source of openly accessible data, and the most common means to connect people and share resources.

However, exploiting these interconnected Webs to obtain information is still an unsolved problem. This conference calls for papers describing recent research in Information Retrieval concerning the integration between a Web of Data and a Web of People, to transform pure data into information, and information into usable knowledge.

The Open research Areas in Information Retrieval (OAIR) conference is a triennial conference, addressing research topics related to the design of robust and large-scale scientific and industrial solutions to information processing.

OAIR 2013 conference is an opportunity to show main research activities, to share knowledge among IR scientific community and to get updates on new scientific work developed by IR community.

This conference is connected to the main IR personalities (see Steering Committee list) and a considerable number of attendances are expected.

We look forward to seeing you in the Europe´s Westernmost and sunniest capital, LISBON!

Topics of interest include:

  • Adapting search to Users
  • Advertising and ad targeting
  • Aggregation of Results
  • Community and Context Aware Search
  • Community-based Filtering and Recommender Systems
  • Community-based IR Theory
  • Community-oriented Content Representation
  • Evaluation of Social IR
  • Improving Web via Social Media
  • Including Crowdsourcing in Search
  • Merging Heterogeneous Web Data
  • Modeling the web of people
  • Personal semantics search
  • Query log analysis
  • Personal semantics search
  • Search over Social Networks
  • Sentiment analysis
  • Social Multimedia and Multimodal IR
  • Social Topic detection
  • Structuring Unstructured Data
  • System Architectures for Social IR
  • User Interfaces and Interactive IR

Having connections to data, assuming anyone knows its whereabouts, isn’t quite the same as making use of it.

Natural Language Processing | Hub

Saturday, July 7th, 2012

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at

Definitely going on my short list of sites to check!

Negation for Document Re-ranking in Ad-hoc Retrieval

Tuesday, June 5th, 2012

Negation for Document Re-ranking in Ad-hoc Retrieval by Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro.

Interesting slide deck that was pointed out to me by Jack Park.

On the “negation” aspects, I found it helpful to review Word Vectors and Quantum Logic Experiments with negation and disjunction by Dominic Widdows and Stanley Peters (cited as an inspiration by the slide authors).

Depending upon your definition of subject identity and subject sameness, you may find negation/disjunction useful for topic map processing.

Geometric and Quantum Methods for Information Retrieval

Tuesday, June 5th, 2012

Geometric and Quantum Methods for Information Retrieval by Yaoyong Li and Hamish Cunningham.


This paper reviews the recent developments in applying geometric and quantum mechanics methods for information retrieval and natural language processing. It discusses the interesting analogies between components of information retrieval and quantum mechanics. It then describes some quantum mechanics phenomena found in the conventional data analysis and in the psychological experiments for word association. It also presents the applications of the concepts and methods in quantum mechanics such as quantum logic and tensor product to document retrieval and meaning of composite words, respectively. The purpose of the paper is to give the state of the art on and to draw attention of the IR community to the geometric and quantum methods and their potential applications in IR and NLP.

More complex models can (may?) lead to better IR methods, but:

Moreover, as Hilbert space is the mathematical foundation for quantum mechanics (QM), basing IR on Hilbert space creates an analogy between IR and QM and may usefully bring some concepts and methods from QM into IR. (p.24)

is a dubious claim at best.

The “analogy” between QM and IR makes the point:

a quantum system a collection of object for retrieval
complex Hilbert space information space
state vector objects in collection
observable query
measurement search
eigenvalues relevant or not for one object
probability of getting one eigenvalue relevance degree of object to query

The authors are comparing apples and oranges. For example, “complex Hilbert space” and “information space.”

A “complex Hilbert space” is a model that has been found useful with another model, one called quantum mechanics.

An “information space,” on the other hand, encompasses models known to use “complex Hilbert spaces” and more. Depends on the information space of interest.

Or the notion of “observable” being paired with “query.”

Complex Hilbert spaces may be quite useful for IR, but tying IR to quantum mechanics isn’t required to make use of it.

Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Tuesday, June 5th, 2012

6th International Workshop on Information Filtering and Retrieval: Novel Distributed Systems and Applications – DART 2012

Paper Submission: June 21, 2012
Authors Notification: July 10, 2012
Final Paper Submission and Registration: July 24, 2012

In conjunction with International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – IC3K 2012 – 04 – 07 October, 2012 – Barcelona, Spain.


Nowadays users are more and more interested in information rather than in mere raw data. The huge amount of accessible data sources is growing rapidly. This calls for novel systems providing effective means of searching and retrieving information with the fundamental goal of making it exploitable by humans and machines.
DART focuses on researching and studying new challenges in distributed information filtering and retrieval. In particular, DART aims to investigate novel systems and tools to distributed scenarios and environments. DART will contribute to discuss and compare suitable novel solutions based on intelligent techniques and applied in real-world applications.
Information Retrieval attempts to address similar filtering and ranking problems for pieces of information such as links, pages, and documents. Information Retrieval systems generally focus on the development of global retrieval techniques, often neglecting individual user needs and preferences.
Information Filtering has drastically changed the way information seekers find what they are searching for. In fact, they effectively prune large information spaces and help users in selecting items that best meet their needs, interests, preferences, and tastes. These systems rely strongly on the use of various machine learning tools and algorithms for learning how to rank items and predict user evaluation.

Topics of Interest

Topics of interest will include (but not are limited to):

  • Web Information Filtering and Retrieval
  • Web Personalization and Recommendation
  • Web Advertising
  • Web Agents
  • Web of Data
  • Semantic Web
  • Linked Data
  • Semantics and Ontology Engineering
  • Search for Social Networks and Social Media
  • Natural Language and Information Retrieval in the Social Web
  • Real-time Search
  • Text categorization

If you are interested and have the time (or graduate students with the time), abstracts from prior conferences are here. Would be a useful exercise to search out publicly available copies. (As far as I can tell, no abstracts from DART.)

Are visual dictionaries generalizable?

Sunday, May 13th, 2012

Are visual dictionaries generalizable? by Otavio A. B. Penatti, Eduardo Valle, and Ricardo da S. Torres


Mid-level features based on visual dictionaries are today a cornerstone of systems for classification and retrieval of images. Those state-of-the-art representations depend crucially on the choice of a codebook (visual dictionary), which is usually derived from the dataset. In general-purpose, dynamic image collections (e.g., the Web), one cannot have the entire collection in order to extract a representative dictionary. However, based on the hypothesis that the dictionary reflects only the diversity of low-level appearances and does not capture semantics, we argue that a dictionary based on a small subset of the data, or even on an entirely different dataset, is able to produce a good representation, provided that the chosen images span a diverse enough portion of the low-level feature space. Our experiments confirm that hypothesis, opening the opportunity to greatly alleviate the burden in generating the codebook, and confirming the feasibility of employing visual dictionaries in large-scale dynamic environments.

The authors use the Caltech-101 image set because of its “diversity.” Odd because they cite the Caltech-256 image set, which was created to answer concerns about the lack of diversity in the Caltech-101 image set.

Not sure this paper answers the issues it raises about visual dictionaries.

Wanted to bring it to your attention because representative dictionaries (as opposed to comprehensive ones) may be lurking just beyond the semantic horizon.

Saving the Old IR Literature: a new batch

Saturday, April 21st, 2012

Saving the Old IR Literature: a new batch

Saw a retweet of a tweet from @djoerd on this new release.

Thanks ACM SIGIR! (Special Interest Group on Information Retrieval)

Just the titles should get you interested:

  • Natural Language in Information Retrieval – Donald E. Walker, Hans Karlgren, Martin Kay – Skriptor AB, Stockholm, 1977
  • Annual Report: Automatic Informative Abstracting and Extracting – L. L. Earl – Lockheed Missiles and Space Company, 1972
  • Free Text Retrieval Evaluation – Pauline Atherton, Kenneth H. Cook, Jeffrey Katzer – Syracuse University, 1972
  • Information Storage and Retrieval: Scientific Report No. ISR-7 – Gerard Salton – The National Science Foundation, 1964
  • Information Storage and Retrieval: Scientific Report No. ISR-8 – Gerard Salton – The National Science Foundation, 1964
  • Information Storage and Retrieval: Scientific Report No. ISR-9 – Gerard Salton – The National Science Foundation, 1965
  • Information Storage and Retrieval: Scientific Report No. ISR-14 – Gerard Salton – The National Science Foundation, 1968
  • Information Storage and Retrieval: Scientific Report No. ISR-16 – Gerard Salton – The National Science Foundation, 1969
  • Automatic Indexing: A State of the Art Review – Karen Sparck Jones – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5193, 1974
  • Final Report on International Research Forum in Information Science: The Theoretical Basis of Information Science – B.C. Vickery, S.E. Robertson, N.J. Belkin – British Library Research and Development Report No. 5262, 1975
  • Report on the Need for and Provision for an ‘IDEAL’Information Retrieval Test Collection – K. Sparck Jones, C.J. Van Rijsbergen – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5266, 1975
  • Report on a Design Study for the’IDEAL’ Information Retrieval Test Collection – K. Sparck Jones, R.G. Bates – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5428, 1977
  • Research on Automatic Indexing 1974-1976, Volume 1: Text – K. Sparck Jones, R.G. Bates – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5464, 1977
  • Statistical Bases of Relevance Assessment for the ‘IDEAL’ Information Retrieval Test Collection – H. Gilbert, K. Sparck Jones – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5481, 1979
  • Design Study for an Anomalous State of Knowledge Based Information Retrieval System – N.J. Belkin, R.N. Oddy – University of Aston, Computer CentreBritish Library Research and Development Report No. 5547, 1979
  • Research on Relevance Weighting, 1976-1979 – K. Sparck Jones, C.A. Webster – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5553, 1980
  • New Models in Probabilistic Information Retrieval – C.J. van Rijsbergen, S.E. Robertson, M.F. Porter – Computer Laboratory, University of CambridgeBritish Library Research and Development Report No. 5587, 1980
  • Statistical problems in the application of probabilistic models to information retrieval – S.E. Robertson, J.D. Bovey – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5739, 1982
  • A front-end for IR experiments – S.E. Robertson, J.D. Bovey – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5807, 1983
  • An operational evaluation of weighting, ranking and relevance feedback via a front-end system – S.E. Robertson, C.L. Thompson – Centre for Information Science City UniversityBritish Library Research and Development Report No. 5549, 1987
  • Okapi at City: An evaluation facility for interactive – Stephen Walker, Micheline Hancock-Beaulieu – Centre for Information Science City UniversityBritish Library Research and Development Report No. 6056, 1991
  • Improving Subject Retrieval in Online Catalogues: Stemming, automatic spelling correction and cross-reference tables – Stephen Walker, Richard M Jones – The Polytechnic of Central LondonBritish Library Research Paper No. 24, 1987
  • Designing an Online Public Access Catalogue: Okapi, a catalogue on a local area network – Nathalie Nadia Mitev, Gillian M Venner, Stephen Walker – The Polytechnic of Central LondonLibrary and Information Research Report 39, 1985
  • Improving Subject Retrieval in Online Catalogues: Relevance feedback and query expansion – Stephen Walker, Rachel De Vere – The Polytechnic of Central LondonBritish Library Research Paper No. 72, 1989
  • Evaluation of Online Catalogues: an assessment of methods – Micheline Hancock-Beaulieu, Stephen Robertson, Colin Neilson – Centre for Information Science City UniversityBritish Library Research Paper No. 78, 1990