Archive for the ‘Concept Detection’ Category

You Say “Concepts” I Say “Subjects”

Wednesday, August 27th, 2014

Researchers are cracking text analysis one dataset at a time by Derrick Harris.

From the post:

Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

As explained in a blog post, the company analyzed the New York Times Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.

A summary of some of the recent work on recognizing concepts in text and not just key words.

As topic mappers know, there is no universal one to one correspondence between words and subjects (“concepts” in this article). Finding “concepts” means that whatever words triggered that recognition, we can supply other information that is known about the same concept.

Certainly will make topic map authoring easier when text analytics can generate occurrence data and decorate existing topic maps with their findings.

Embedding Concepts in text for smarter searching with Solr4

Sunday, August 11th, 2013

Embedding Concepts in text for smarter searching with Solr4 by Sujit Pal.

From the post:

Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept’s position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens – for example, the phrase “breast cancer” becomes “breast0cancer”. This allows us to rewrite queries such as “breast cancer radiotherapy”~5 as “2790981 2791965″~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms – for example, “breast cancer” can be variously called “breast neoplasm”, “breast CA”, etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, “radiotherapy for breast cancer patients” would not match our query since “breast cancer patient” is a different concept than “breast cancer” and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

So if Solr4 can make documents smarter, can the same be said about topics?

Recalling that “document” for Solr is defined by your indexing, not some arbitrary byte count.

As we are indexing topics we could add information to topics to make merging more robust.

One possible topic map flow being:

Index -> addToTopics -> Query -> Results -> Merge for Display.


From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas

Friday, May 18th, 2012

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas by Valentin Spitkovsky and Peter Norvig (Google Research Team).

From the post:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

(examples omitted)

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data. (emphasis added)

Did you catch those numbers?

Now there is a truly remarkable resource.

What will you make out of it?


Sunday, December 4th, 2011

FACTA – Finding Associated Concepts with Text Analysis

From the Quick Start Guide:

FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.

A very good example of an exploration tool that isn’t overly complex to use.

3rd Canadian Semantic Web Symposium

Tuesday, September 13th, 2011

CSWS2011: The 3rd Canadian Semantic Web Symposium Proceedings of the 3rd Canadian Semantic Web Symposium
Vancouver, British Columbia, Canada, August 5, 2011

An interesting set of papers! I suppose I can be forgiven for looking at the text mining (Hassanpour & Das) and heterogeneous information systems (Khan, Doucette, and Cohen) papers first. 😉 More comments to follow on those.

What are your favorite papers in this batch and why?

The whole proceedings can also be downloaded as a single PDF file.

Edited by:

Christopher J. O. Baker *
Helen Chen **
Ebrahim Bagheri ***
Weichang Du ****

* University of New Brunswick, Saint John, NB, Canada, Department of Computer Science & Applied Statistics
** University of Waterloo, Waterloo, ON, Canada, School of Public Health and Health Systems
*** Athabasca University, School of Computing and Information Systems
**** University of New Brunswick, NB, Canada, Faculty of Computer Science

Table of Contents

Full Paper

  1. The Social Semantic Subweb of Virtual Patient Support Groups
    Harold Boley, Omair Shafiq, Derek Smith, Taylor Osmun
  2. Leveraging SADI Semantic Web Services to Exploit Fish Ecotoxicology Data
    Matthew M. Hindle, Alexandre Riazanov, Edward S. Goudreau, Christopher J. Martyniuk, Christopher J. O. Baker
  3. Short Paper

  4. Towards Evaluating the Impact of Semantic Support for Curating the Fungus Scientic Literature
    Marie-Jean Meurs, Caitlin Murphy, Nona Naderi, Ingo Morgenstern, Carolina Cantu, Shary Semarjit, Greg Butler, Justin Powlowski, Adrian Tsang, René Witte
  5. Ontology based Text Mining of Concept Definitions in Biomedical Literature
    Saeed Hassanpour, Amar K. Das
  6. Social and Semantic Computing in Support of Citizen Science
    Joel Sachs, Tim Finin
  7. Unresolved Issues in Ontology Learning
    Amal Zouaq, Dragan Gaševic, Marek Hatala
  8. Poster

  9. Towards Integration of Semantically Enabled Service Families in the Cloud
    Marko Boškovic, Ebrahim Bagheri, Georg Grossmann, Dragan Gaševic, Markus Stumptner
  10. SADI for GMOD: Semantic Web Services for Model Organism Databases
    Ben Vandervalk, Michel Dumontier, E Luke McCarthy, Mark D Wilkinson
  11. An Ontological Approach for Querying Distributed Heterogeneous Information Systems
    Atif Khan, John A. Doucette, Robin Cohen

Please see the CSWS2011 website for further details.

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Thursday, August 11th, 2011

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

From the post:

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

Sounds interesting.

In particular:

…concepts (and their associated synonyms) are passed through….

sounds like topic map talk to me partner!

Depends on where you put your emphasis.

Probabilistic User Modeling in the Presence of Drifting Concepts

Saturday, December 4th, 2010

Probabilistic User Modeling in the Presence of Drifting Concepts Authors(s): Vikas Bhardwaj, Ramaswamy Devarajan


We investigate supervised prediction tasks which involve multiple agents over time, in the presence of drifting concepts. The motivation behind choosing the topic is that such tasks arise in many domains which require predicting human actions. An example of such a task is recommender systems, where it is required to predict the future ratings, given features describing items and context along with the previous ratings assigned by the users. In such a system, the relationships among the features and the class values can vary over time. A common challenge to learners in such a setting is that this variation can occur both across time for a given agent, and also across different agents, (i.e. each agent behaves differently). Furthermore, the factors causing this variation are often hidden. We explore probabilistic models suitable for this setting, along with efficient algorithms to learn the model structure. Our experiments use the Netflix Prize dataset, a real world dataset which shows the presence of time variant concepts. The results show that the approaches we describe are more accurate than alternative approaches, especially when there is a large variation among agents. All the data and source code would be made open-source under the GNU GPL.

Interesting because not only do concepts drift from user to user but modeling users as existing in neighborhoods of other users was more accurate than purely homogeneous or heterogeneous models.


  1. If there is a “neighborhood” effect on users, what, if anything does that imply for co-occurrence of terms? (3-5 pages, no citations)
  2. How would you determine “neighborhood” boundaries for terms? (3-5 pages, citations)
  3. Do “neighborhoods” for terms vary by semantic domains? (3-5 pages, citations)

Be aware that the Netflix dataset is no longer available. Possibly in response to privacy concerns. A demonstration of the utility of such concerns and their advocates.

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009

Saturday, November 6th, 2010

The University of Amsterdam’s Concept Detection System at ImageCLEF 2009. Authors: Koen E. A. van de Sande, Theo Gevers and Arnold W. M. Smeulders Keywords: Color, Invariance, Concept Detection, Object and Scene Recognition, Bag-of-Words, Photo Annotation, Spatial Pyramid


Our group within the University of Amsterdam participated in the large-scale visual concept detection task of ImageCLEF 2009. Our experiments focus on increasing the robustness of the individual concept detectors based on the bag-of-words approach, and less on the hierarchical nature of the concept set used. To increase the robustness of individual concept detectors, our experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. The participation in ImageCLEF 2009 has been successful, resulting in the top ranking for the large-scale visual concept detection task in terms of both EER and AUC. For 40 out of 53 individual concepts, we obtain the best performance of all submissions to this task. For the hierarchical evaluation, which considers the whole hierarchy of concepts instead of single detectors, using the concept likelihoods estimated by our detectors directly works better than scaling these likelihoods based on the class priors.

Good example of the content to expect from ImageCLEF papers.

This is a very important area of rapidly developing research.

ImageCLEF – The CLEF Cross Language Image Retrieval Track

Saturday, November 6th, 2010

ImageCLEF – The CLEF Cross Language Image Retrieval Track.

The European side of working with digital video.

From the 2009 event website:

ImageCLEF is the cross-language image retrieval track run as part of the Cross Language Evaluation Forum (CLEF) campaign. This track evaluates retrieval of images described by text captions based on queries in a different language; both text and image matching techniques are potentially exploitable.

TREC Video Retrieval Evaluation

Saturday, November 6th, 2010

TREC Video Retrieval Evaluation.

Since I have posted several resources on digital video and concept discovery today, listing the TREC track on the same seemed appropriate.

From the website:

The TREC conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. In 2001 and 2002 the TREC series sponsored a video “track” devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. Beginning in 2003, this track became an independent evaluation (TRECVID) with a workshop taking place just before TREC.

You will find publications, tools, bibliographies, data sets, etc., first class resource site.

Internet Multimedia Search and Mining

Saturday, November 6th, 2010

Internet Multimedia Search and Mining Authors: Xian-Sheng Hua, Marcel Worring, and Tat-Seng Chua


In this chapter, we address the visual learning of automatic concept detectors from web video as available from services like YouTube. While allowing a much more efficient, flexible, and scalable concept learning compared to expert labels, web-based detectors perform poorly when applied to different domains (such as specific TV channels). We address this domain change problem using a novel approach, which – after an initial training on web content – performs a highly efficient online adaptation on the target domain.

In quantitative experiments on data from YouTube and from the TRECVID campaign, we first validate that domain change appears to be the key problem for web-based concept learning, with much more significant impact than other phenomena like label noise. Second, the proposed adaptation is shown to improve the accuracy of web-based detectors significantly, even over SVMs trained on the target
domain. Finally, we extend our approach with active learning such that adaptation can be interleaved with manual annotation for an efficient exploration of novel domains.

The authors cite authority for the proposition that by 2013 that 91% of all Internet traffic will be digital video.

Perhaps, perhaps not, but in any event, “concept detection” is an important aid to topic map authors working with digital video.


  1. Later research on “concept detection” in digital video? (annotated bibliography)
  2. Use in library contexts? (3-5 pages, citations)
  3. How would you design human augmentation of automated detection? (project)