Archive for the ‘Keywords’ Category

Building a language-independent keyword-based system with the Wikipedia Miner

Monday, October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.

A New Entity Salience Task with Millions of Training Examples

Monday, March 10th, 2014

A New Entity Salience Task with Millions of Training Examples by Dan Gillick and Jesse Dunietz.


Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

The article concludes:

We believe entity salience is an important task with many applications. To facilitate further research, our automatically generated salience annotations, along with resolved entity ids, for the subset of the NYT corpus discussed in this paper are available here:

A classic approach to a CS article: new approach/idea, data + experiments, plus results and code. It doesn’t get any better.

The results won’t be perfect, but the question is: Are they “acceptable results?”

Which presumes a working definition of “acceptable” that you have hammered out with your client.

I first saw this in a tweet by Stefano Bertolo.

Keyword Search, Plus a Little Magic

Wednesday, May 15th, 2013

Keyword Search, Plus a Little Magic by Geoffrey Pullum.

From the post:

I promised last week that I would discuss three developments that turned almost-useless language-connected technological capabilities into something seriously useful. The one I want to introduce first was introduced by Google toward the end of the 1990s, and it changed our whole lives, largely eliminating the need for having full sentences parsed and translated into database query language.

The hunch that the founders of Google bet on was that simple keyword search could be made vastly more useful by taking the entire set of pages containing all of the list of search words and not just returning it as the result but rather ranking its members by influentiality and showing the most influential first. What a page contains is not the only relevant thing about it: As with any academic publication, who values it and refers to it is also important. And that is (at least to some extent) revealed in the link structure of the Web.

In his first post, which wasn’t sympathetic to natural language processing, Geoffrey baited his critics into fits of frenzied refutation.

Fits of refutation that failed to note Geoffrey hadn’t completed his posts on natural language processing.

Take the keyword search posting for instance.

I won’t spoil the surprise for you but the fourth fact that Geoffrey says Google relies upon could have serious legs for topic map authoring and interface design.

And not a little insight into what we call natural language processing.

More posts are to follow in this series.

I suggest we savor each one as it appears and after reflection on the whole, sally forth onto the field of verbal combat.

User evaluation of automatically generated keywords and toponyms… [of semantic gaps]

Tuesday, January 22nd, 2013

User evaluation of automatically generated keywords and toponyms for geo-referenced images by Frank O. Ostermann, Martin Tomko, Ross Purves. (Ostermann, F. O., Tomko, M. and Purves, R. (2013), User evaluation of automatically generated keywords and toponyms for geo-referenced images. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22738)


This article presents the results of a user evaluation of automatically generated concept keywords and place names (toponyms) for geo-referenced images. Automatically annotating images is becoming indispensable for effective information retrieval, since the number of geo-referenced images available online is growing, yet many images are insufficiently tagged or captioned to be efficiently searchable by standard information retrieval procedures. The Tripod project developed original methods for automatically annotating geo-referenced images by generating representations of the likely visible footprint of a geo-referenced image, and using this footprint to query spatial databases and web resources. These queries return raw lists of potential keywords and toponyms, which are subsequently filtered and ranked. This article reports on user experiments designed to evaluate the quality of the generated annotations. The experiments combined quantitative and qualitative approaches: To retrieve a large number of responses, participants rated the annotations in standardized online questionnaires that showed an image and its corresponding keywords. In addition, several focus groups provided rich qualitative information in open discussions. The results of the evaluation show that currently the annotation method performs better on rural images than on urban ones. Further, for each image at least one suitable keyword could be generated. The integration of heterogeneous data sources resulted in some images having a high level of noise in the form of obviously wrong or spurious keywords. The article discusses the evaluation itself and methods to improve the automatic generation of annotations.

An echo of Steve Newcomb’s semantic impedance appears at:

Despite many advances since Smeulders et al.’s (2002) classic paper that set out challenges in content-based image retrieval, the quality of both nonspecialist text-based and content-based image retrieval still appears to lag behind the quality of specialist text retrieval, and the semantic gap, identified by Smeulders et al. as a fundamental issue in content-based image retrieval, remains to be bridged. Smeulders defined the semantic gap as

the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation. (p. 1353)

In fact, text-based systems that attempt to index images based on text thought to be relevant to an image, for example, by using image captions, tags, or text found near an image in a document, suffer from an identical problem. Since text is being used as a proxy by an individual in annotating image content, those querying a system may or may not have similar worldviews or conceptualizations as the annotator. (emphasis added)

That last sentence could have come out of a topic map book.

Curious what you make of the author’s claim that spatial locations provide an “external context” that bridges the “semantic gap?”

If we all use the same map of spatial locations, are you surprised by the lack of a “semantic gap?”

Authors and Articles, Keywords, SOMs and Graphs [Oh My!]

Sunday, November 18th, 2012

Analyzing Authors and Articles Using Keyword Extraction, Self-Organizing Map and Graph Algorithms by Tommi Vatanen , Mari-sanna Paukkeri , Ilari T. Nieminen, Timo Honkela.

An attempt to enable participants at an interdisciplinary conference to find others with similar interests and to learn about other participants.

Be aware the URL given in the article for the online demo now returns a 404.

Interesting approach but be aware that if it was using Likey as described in: A Language-Independent Approach to Keyphrase Extraction and Evaluation, the absence of phrases in the reference corpus may mean the phrases are omitted from the results.

I mention that because the reference corpus was Europarl (European Parliament Proceedings Parallel Corpus).

I would not bet on the similarities between the “European Parliament Proceedings” and the “International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning.” Would you?

Leaving those quibbles to one side, interesting work, particularly if viewed as the means to explore textual data for later editing.

CiteSeer does not report a date for this paper and it does not appear in DBLP for any of the authors. Timo Honkela’s publications page gives it the following suggested BibTeX entry:

author = {Tommi Vatanen and Mari-Sanna Paukkeri and Ilari T. Nieminen and Timo Honkela},
booktitle = {Proceedings of the AKRR08},
pages = {105--111},
title = {Analyzing Authors and Articles Using Keyword Extraction, Self-Organizing Map and Graph Algorithms},
year = {2008},

A Language-Independent Approach to Keyphrase Extraction and Evaluation

Sunday, November 18th, 2012

A Language-Independent Approach to Keyphrase Extraction and Evaluation (2010) by Mari-sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela.


We present Likey, a language-independent keyphrase extraction method based on statistical analysis and the use of a reference corpus. Likey has a very light-weight preprocessing phase and no parameters to be tuned. Thus, it is not restricted to any single language or language family. We test Likey having exactly the same configuration with 11 European languages. Furthermore, we present an automatic evaluation method based on Wikipedia intra-linking.

Useful approach for developing a rough-cut of keywords in documents. Keywords that may indicate a need for topics to represent subjects.

Interesting that:

Phrases occurring only once in the document cannot be selected as keyphrases.

I would have thought unique phrases would automatically qualify as keyphrases. The ranking of phrases, calculated with the reference corpus and text, excludes unique phrases, in the absence of any ratio for ranking.

That sounds like a bug and not a feature to me.

Reasoning that phrases unique to an author are unique identifications of subjects. Certainly grist for a topic map mill.

Web based demonstration:

Mari-Sanna Paukkeri: Contact details and publications.

An XML-Format for Conjectures in Geometry (Work-in-Progress)

Saturday, July 14th, 2012

An XML-Format for Conjectures in Geometry (Work-in-Progress) by Pedro Quaresma.


With a large number of software tools dedicated to the visualisation and/or demonstration of properties of geometric constructions and also with the emerging of repositories of geometric constructions, there is a strong need of linking them, and making them and their corpora, widely usable. A common setting for interoperable interactive geometry was already proposed, the i2g format, but, in this format, the conjectures and proofs counterparts are missing. A common format capable of linking all the tools in the field of geometry is missing. In this paper an extension of the i2g format is proposed, this extension is capable of describing not only the geometric constructions but also the geometric conjectures. The integration of this format into the Web-based GeoThms, TGTP and Web Geometry Laboratory systems is also discussed.

The author notes open questions as:

  • The xml format must be complemented with an extensive set of converters allowing the exchange of information between as many geometric tools as possible.
  • The databases queries, as in TGTP, raise the question of selecting appropriate keywords. A fine grain index and/or an appropriate geometry ontology should be addressed.
  • The i2gatp format does not address proofs. Should we try to create such a format? The GATPs produce proofs in quite different formats, maybe the construction of such unifying format it is not possible and/or desirable in this area.

The “keywords,” “fine grained index,” “geometry ontology,” question yells “topic map” to me.


PS: Converters and different formats also say “topic map,” just not as loudly to me. Your volume may vary. (YVMV)



Wednesday, April 25th, 2012


From the website:

LAILAPS combines a keyword driven search engine for an integrative access to life science databases, machine learning for a content driven relevance ranking, recommender systems for suggestion of related data records and query refinements with a user feedback tracking system for an self learning relevance training.


  • ultra fast keyword based search
  • non-static relevance ranking
  • user specific relevance profiles
  • suggestion of related entries
  • suggestion of related query terms
  • self learning by user tracking
  • deployable at standard desktop PC
  • 100% JAVA
  • installer for in-house deployment

I like the idea of a recommender system that “suggests” related data records and query refinements. It could be wrong.

I am as guilty as anyone of thinking in terms of “correct” recommendations that always lead to relevant data.

That is applying “crisp” set thinking to what is obviously a “rough” set situation. We as readers have to sort out the items in the “rough” set and construct for ourselves, a temporary and fleeting “crisp” set for some particular purpose.

If you are using LAILAPS, I would appreciate a note about your experiences and impressions.

Keyword Indexing for Books vs. Webpages

Wednesday, March 14th, 2012

I was watching a lecture on keyword indexing that started off with a demonstration of an index to a book, which was being compared to indexing web pages. The statement was made that the keyword pointed the reader to a page where that keyword could be found, much like a search engine does for a web page.

Leaving aside the more complex roles that indexes for books play, such as giving alternative terms, classifying the nature of the occurrence of the term (definition, mentioned, footnote, etc.), cross-references, etc., I wondered if there is a difference between a page reference in a book index vs. a web page reference by a search engine?

In some 19th century indexes I have used, the page references are followed by a letter of the alphabet, to indicate that the page is divided into sections, sometimes as many as a – h or even higher. Mostly those are complex reference works, dictionaries, lexicons, works of that type, where the information is fairly dense. (Do you know of any modern examples of indexes where pages are divided? A note would be appreciated.)

I have the sense that an index of a book, without sub-dividing a page, is different from a index pointing to a web page. It may be a difference that has never been made explicit but I think it is important.

Some facts about word length on a “page:”

With a short amount of content, average book page length, the user has little difficulty finding an index term on a page. But the longer the web page, the less useful our instinctive (trained?) scan of the page becomes.

In part because part of the page scrolls out of view. As you may know, that doesn’t happen with a print book.

Scanning of a print book is different from scanning of a webpage. How to account for that difference I don’t know.

Before you suggest Ctrl-F, see Do You Ctrl-F?. What was it you were saying about Ctrl-F?

Web pages (or other electronic media) that don’t replicate the fixed display of book pages result in a different indexing experience for the reader.

If a search engine index could point into a page, it would still be different from a traditional index but would come closer to a traditional index.

(The W3C has steadfastly resisted any effective subpage pointing. See the sad history of XLink/XPointer. You will probably have to ask insiders but it is a well known story.)

BTW, in case you are interested in blog length, see: Bloggers: This Is How Long Your Posts Should Be. Informative and amusing.

Keyword Searching and Browsing in Databases using BANKS

Sunday, March 11th, 2012

Keyword Searching and Browsing in Databases using BANKS

From the post:

BANKS is a system that enables keyword based searches on a relational database. As a paper that was published 10 years ago in ICDE 2002, it has won the most influential paper award for past decade this year at ICDE. Hearty congrats to the team from IIT Bombay’s CSE department.


With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results.

BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

The paper:

It is a very interesting paper.

BTW, can someone point me to the ICDE proceedings where it was voted best paper of the decade? I am assuming that ICDE = International Conference on Data Engineering. I am sure I am just overlooking the award and would like to include a pointer to it in this post. Thanks!

Winning the Keyword Research Game

Wednesday, April 20th, 2011

Winning the Keyword Research Game

Webinar – Tuesday, April 26, 2011

From the website:

Learn key strategies for leveraging keyword research to spend smarter, gain competitive advantage and ultimately drive more leads!

In this webinar, WordStream founder Larry Kim shares his exclusive tips and techniques for winning the keyword research game. You’ll learn the three-step process he follows when conducting keyword research for SEO and PPC, including tips for:

  • How to dominate a search category for your brand
  • Why keyword niches are more important than single keywords
  • How to apply your keyword research in on-page SEO

Topic maps can facilitate discovery of information by different means of identifying the same subject, but only if you have recorded those different means in your topic map.

Quite often that will mean several different keywords for any given subject.

I will try to attend and report back.