Archive for the ‘Text Feature Extraction’ Category

(String/text processing)++:…

Thursday, May 15th, 2014

(String/text processing)++: stringi 0.2-3 released by Marek Gągolewski.

From the post:

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

Here is a very general list of the most important features available in the current version of stringi:

  • string searching:
    • with ICU (Java-like) regular expressions,
    • ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
    • very fast, locale-independent byte-wise pattern matching;
  • joining and duplicating strings;
  • extracting and replacing substrings;
  • string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);
  • text transliteration;
  • text collation (comparing, sorting);
  • text boundary analysis (e.g. for extracting individual words);
  • random string generation;
  • Unicode normalization;
  • character encoding conversion and detection;

and many more.

Interesting isn’t it? How CS keeps circling around back to strings?


A Language-Independent Approach to Keyphrase Extraction and Evaluation

Sunday, November 18th, 2012

A Language-Independent Approach to Keyphrase Extraction and Evaluation (2010) by Mari-sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela.


We present Likey, a language-independent keyphrase extraction method based on statistical analysis and the use of a reference corpus. Likey has a very light-weight preprocessing phase and no parameters to be tuned. Thus, it is not restricted to any single language or language family. We test Likey having exactly the same configuration with 11 European languages. Furthermore, we present an automatic evaluation method based on Wikipedia intra-linking.

Useful approach for developing a rough-cut of keywords in documents. Keywords that may indicate a need for topics to represent subjects.

Interesting that:

Phrases occurring only once in the document cannot be selected as keyphrases.

I would have thought unique phrases would automatically qualify as keyphrases. The ranking of phrases, calculated with the reference corpus and text, excludes unique phrases, in the absence of any ratio for ranking.

That sounds like a bug and not a feature to me.

Reasoning that phrases unique to an author are unique identifications of subjects. Certainly grist for a topic map mill.

Web based demonstration:

Mari-Sanna Paukkeri: Contact details and publications.

National Centre for Text Mining (NaCTeM)

Friday, June 29th, 2012

National Centre for Text Mining (NaCTeM)

From the webpage:

The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to

  • text mining services provided by NaCTeM
  • software tools, both those developed by the NaCTeM team and by other text mining groups
  • seminars, general events, conferences and workshops
  • tutorials and demonstrations
  • text mining publications

Let us know if you would like to include any of the above in our website.

This is a real treasure trove of software, resources and other materials.

I will be working in reports on “finds” at this site for quite some time.

Text Feature Extraction (tf-idf) – Part 1

Sunday, September 18th, 2011

Text Feature Extraction (tf-idf) – Part 1 by Christian Perone.

To give you a taste of the post:

Short introduction to Vector Space Model (VSM)

In information retrieval or text mining, the term frequency – inverse document frequency also called tf-idf, is a well know method to evaluate how important is a word in a document. tf-idf are also a very interesting way to convert the textual representation of information into a Vector Space Model (VSM), or into sparse features, we’ll discuss more about it later, but first, let’s try to understand what is tf-idf and the VSM.

VSM has a very confusing past, see for example the paper The most influential paper Gerard Salton Never Wrote that explains the history behind the ghost cited paper which in fact never existed; in sum, VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document; it is important to note that the classical VSM proposed by Salton incorporates local and global parameters/information (in a sense that it uses both the isolated term being analyzed as well the entire collection of documents). VSM, interpreted in a lato sensu, is a space where text is represented as a vector of numbers instead of its original string textual representation; the VSM represents the features extracted from the document.

The link to the The most influential paper Gerard Salton Never Wrote fails. Try the cached copy at CiteSeer: The most influential paper Gerard Salton Never Wrote.

Recommended reading.