Archive for the ‘Text Extraction’ Category

Nuremberg Trial Verdicts [70th Anniversary]

Saturday, October 1st, 2016

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

A Language-Independent Approach to Keyphrase Extraction and Evaluation

Sunday, November 18th, 2012

A Language-Independent Approach to Keyphrase Extraction and Evaluation (2010) by Mari-sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela.


We present Likey, a language-independent keyphrase extraction method based on statistical analysis and the use of a reference corpus. Likey has a very light-weight preprocessing phase and no parameters to be tuned. Thus, it is not restricted to any single language or language family. We test Likey having exactly the same configuration with 11 European languages. Furthermore, we present an automatic evaluation method based on Wikipedia intra-linking.

Useful approach for developing a rough-cut of keywords in documents. Keywords that may indicate a need for topics to represent subjects.

Interesting that:

Phrases occurring only once in the document cannot be selected as keyphrases.

I would have thought unique phrases would automatically qualify as keyphrases. The ranking of phrases, calculated with the reference corpus and text, excludes unique phrases, in the absence of any ratio for ranking.

That sounds like a bug and not a feature to me.

Reasoning that phrases unique to an author are unique identifications of subjects. Certainly grist for a topic map mill.

Web based demonstration:

Mari-Sanna Paukkeri: Contact details and publications.

Layout-aware text extraction from full-text PDF of scientific articles

Monday, October 8th, 2012

Layout-aware text extraction from full-text PDF of scientific articles by Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy and Gully APC Burns. (Source Code for Biology and Medicine 2012, 7:7 doi:10.1186/1751-0473-7-7)



The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.


Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.


LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at

Scanning TOCs from a variety of areas can uncover goodies like this one.

What is the most recent “unexpected” paper/result outside your “field” have you found?

Web Data Extraction, Applications and Techniques: A Survey

Tuesday, September 11th, 2012

Web Data Extraction, Applications and Techniques: A Survey by Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner.


Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of application domains. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc application domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.

This survey aims at providing a structured and comprehensive overview of the research efforts made in the field of Web Data Extraction. The fil rouge of our work is to provide a classification of existing approaches in terms of the applications for which they have been employed. This differentiates our work from other surveys devoted to classify existing approaches on the basis of the algorithms, techniques and tools they use.

We classified Web Data Extraction approaches into categories and, for each category, we illustrated the basic techniques along with their main variants.

We grouped existing applications in two main areas: applications at the Enterprise level and at the Social Web level. Such a classification relies on a twofold reason: on one hand, Web Data Extraction techniques emerged as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. On the other hand, Web Data Extraction techniques allow for gathering a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities of analyzing human behaviors on a large scale.

We discussed also about the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

Comprehensive (> 50 pages) survey of web data extraction. Supplements and updates existing work by its focus on classifying by field of use, web data extraction.

Very likely to lead to adaptation of techniques from one field to another.

National Centre for Text Mining (NaCTeM)

Friday, June 29th, 2012

National Centre for Text Mining (NaCTeM)

From the webpage:

The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to

  • text mining services provided by NaCTeM
  • software tools, both those developed by the NaCTeM team and by other text mining groups
  • seminars, general events, conferences and workshops
  • tutorials and demonstrations
  • text mining publications

Let us know if you would like to include any of the above in our website.

This is a real treasure trove of software, resources and other materials.

I will be working in reports on “finds” at this site for quite some time.

Text Analytics for Telecommunications – Part 1

Tuesday, March 20th, 2012

Text Analytics for Telecommunications – Part 1 by Themos Kalafatis.

From the post:

As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which i will present in the European Text Analytics Summit is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.

The Telcos used for the Case Study are Telenor, MT:S and VIP Mobile which are located in Serbia. The analysis aims to identify the perception of Customers for each of the three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers – Subscribers.

The start of a very useful series on non-English text analysis. The sort that is in demand by agencies of various governments.

Come to think of it, text analysis of English/non-English government information is probably in demand by non-government groups. 😉

Mining Text Data

Monday, January 23rd, 2012

Mining Text Data Charu Aggarwal and ChengXiang Zhai, Springer, February 2012, Approximately 500 pages.

From the publisher’s description:

Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned.

Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases.

Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book.

Not at the publisher’s site but you can see the Table of Contents and chapter 4, A SURVEY OF TEXT CLUSTERING ALGORITHMS and chapter 6, A SURVEY OF TEXT CLASSIFICATION ALGORITHMS at:

The two chapters you can download from Aggarwal’s website will give you a good idea of what to expect from the text.

While an excellent survey work, with chapters written by experts in various sub-fields, it also suffers from the survey work format.

For example, for the two sample chapters, there are overlaps in the bibliographies for both chapters. Not surprising given the closely related subject matter but as a reader I would be interested in discovering that some works are cited in both chapters. Something that is possible given the back of the chapter bibliography format, only by repetitive manual inspection.

Although I rail against examples in standards, expanding the survey reference work format to include more details and examples would only increase its usefulness and possible its life as a valued reference.

Which raises the question of having a print format for survey works at all. The research landscape is changing quickly and a shelf life of 2 to 3 years, if that long, seems a bit brief for the going rate for print editions. Printed versions of chapters as smaller and more timely works on demand, that is a value-add proposition that Springer is in a unique position to bring to its customers.

TXR: a Pattern Matching Language (Not Just)….

Sunday, January 22nd, 2012

TXR: a Pattern Matching Language (Not Just) for Convenient Text Extraction

From the webpage:

TXR (“texer” or “tee ex ar”) is a new and growing language oriented toward processing text, packaged as a utility (the txr command) that runs in POSIX environments and on Microsoft Windows.

Working with TXR is different from most text processing programming languages. Constructs in TXR aren’t imperative statements, but rather pattern-matching directives: each construct terminates by matching, failing, or throwing an exception. Searching and backtracking behaviors are implicit.

The development of TXR began when I needed a utility to be used in shell programming which would reverse the action of a “here-document”. Here-documents are a programming language feature for generating multi-line text from a boiler-plate template which contains variables to be substituted, and possibly other constructs such as various functions that generate text. Here-documents appeared in the Unix shell decades ago, but most of today’s web is basically a form of here-document, because all non-trivial websites generate HTML dynamically, substituting variable material into templates on the fly. Well, in the given situation I was programming in, I didn’t want here documents as much as “there documents”: I wanted to write a template of text containing variables, but not to generate text but to do the reverse: match the template against existing text which is similar to it, and capture pieces of it into variables. So I developed a utility to do just that: capture these variables from a template, and then generate a set of variable assignments that could be eval-ed in a shell script.

That was sometime in the middle of 2009. Since then TXR has become a lot more powerful. It has features like structured named blocks with nonlocal exits, structured exception handling, pattern matching functions, and numerous other features. TXR is powerful enough to parse grammars, yet simple to use on trivial tasks.

For things that can’t be easily done in the pattern matching language, TXR has a built-in Lisp dialect, which supports goodies like first class functions with closures over lexical environments, I/O (including string and list streams), hash tables with optional weak semantics, and arithmetic with arbitrary precision (“bignum”) integers.

A powerful tool for text extraction/manipulation.

Term Extraction and Image Labeling with Python and topia.termextract

Monday, October 31st, 2011

Term Extraction and Image Labeling with Python and topia.termextract

From the post:

Here’s the question I want to answer: given an image and some related text, can I automatically find a subset of phrases in the text that describe the image?

An amusing description of the use of topia.termextract for term extraction/image labeling.

To be honest, even after being told the image is of Susan Coffey, an alleged celebrity, I still don’t know who it is. I haven’t seen her at Balisage so I assume she isn’t a markup person. Suggestions?

tm – Text Mining Package

Friday, October 28th, 2011

tm – Text Mining Package

From the webpage:

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

With the package ships native support for handling the Reuters-21578 data set, Gmane RSS feeds, e-mails, and several classic file formats (e.g. plain text, CSV text, or PDFs).

Admittedly, the “tm” caught my attention but a quick review confirmed that the package could be useful to topic map authors.

Decision Support for e-Governance: A Text Mining Approach

Saturday, September 3rd, 2011

Decision Support for e-Governance: A Text Mining Approach by G.Koteswara Rao, and Shubhamoy Dey.


Information and communication technology has the capability to improve the process by which governments involve citizens in formulating public policy and public projects. Even though much of government regulations may now be in digital form (and often available online), due to their complexity and diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with the advent of a number of electronic online forums, social networking sites and blogs, the opportunity of gathering citizens’ petitions and stakeholders’ views on government policy and proposals has increased greatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the other hand, text mining has come a long way from simple keyword search, and matured into a discipline capable of dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help in retrieval of information and relationships from textual data sources, thereby assisting policy makers in discovering associations between policies and citizens’ opinions expressed in electronic public forums and blogs etc. We also present here, an integrated text mining based architecture for e-governance decision support along with a discussion on the Indian scenario.

The principles of subject identity could usefully inform many aspects of this “project.” I hesitate to use the word “project” for an effort that will eventually involve twenty-two (22) official languages, several scripts and governance of several hundred million people.

A good starting point for learning about the issues facing e-Governance in India.

Learn to Use DiscoverText – Free Tutorial Webinar

Thursday, August 25th, 2011

Learn to Use DiscoverText – Free Tutorial Webinar

From the announcement:

This free, live Webinar introduces DiscoverText and key features used to ingest, filter, search & code text. We take your questions and demonstrate the newest tools, including a Do-It-Yourself (DIY) machine-learning classifier. You can create a classification scheme, train the system, and run the classifier in less than 20 minutes.

DiscoverText’s latest feature additions can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. Once a classification scheme is created, you can then use advanced, threshold-sensitive filters to look at just the documents you want.

You can also generate interactive, custom, salient word clouds using the “Cloud Explorer” and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds.

Apologies but if you notice the date this announcement was posted, the day before the webinar, I posted this late.

Puzzles me why there is a tendency to announce webinars the day or two in advance. Why not a week?

They have recorded prior versions of this presentation so you can still learn something about DiscoverText.


Friday, July 1st, 2011


From the About page:

What is ScraperWiki?

There’s lots of useful data on the internet – crime statistics, government spending, missing kittens…

But getting at it isn’t always easy. There’s a table here, a report there, a few web pages, PDFs, spreadsheets… And it can be scattered over thousands of different places on the web, making it hard to see the whole picture and the story behind it. It’s like trying to build something from Lego when someone has hidden the bricks all over town and you have to find them before you can start building!

To get at data, programmers write bits of code called ‘screen scrapers’, which extract the useful bits so they can be reused in other apps, or rummaged through by journalists and researchers. But these bits of code tend to break, get thrown away or forgotten once they have been used, and so the data is lost again. Which is bad.

ScraperWiki is an online tool to make that process simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it’s a wiki, other programmers can contribute to and improve the code.

Something to keep an eye on and whenever possible, to contribute to.

People make data difficult to access for a reason. Let’s disappoint them.

Evaluating Text Extraction Algorithms

Thursday, June 16th, 2011

Evaluating Text Extraction Algorithms

From the post:

Lately I’ve been working on evaluating and comparing algorithms, capable of extracting useful content from arbitrary html documents. Before continuing I encourage you to pass trough some of my previous posts, just to get a better feel of what we’re dealing with; I’ve written a short overview, compiled a list of resources if you want to dig deeper and made a feature wise comparison of related software and APIs.

If you’re not simply creating topic map content, you are mining content from other sources, such as texts, to point to or include in a topic map. A good set of posts on tools and issues surrounding that task.

ICON Programming for Humanists, 2nd edition

Wednesday, May 18th, 2011

ICON Programming for Humanists, 2nd edition

From the foreword to the first edition:

This book teaches the principles of Icon in a very task-oriented fashion. Someone commented that if you say “Pass the salt” in correct French in an American university you get an A. If you do the same thing in France you get the salt. There is an attempt to apply this thinking here. The emphasis is on projects which might interest the student of texts and language, and Icon features are instilled incidentally to this. Actual programs are exemplified and analyzed, since by imitation students can come to devise their own projects and programs to fulfill them. A number of the illustrations come naturally enough from the field of Stylistics which is particularly apt for computerized approaches.

I can’t say that the success of ICON is a recommendation for task-oriented teaching but as I recall the first edition, I thought it was effective.

Data mining of texts is an important skill in the construction of topic maps.

This is a very good introduction to that subject.

Overview of Text Extraction Algorithms

Sunday, March 20th, 2011

Overview of Text Extraction Algorithms

Short review and pointers to posts by computer science student Tomaž Kova?i?e listing resources for text extraction.

If you are building topic maps based on text extraction from web pages in particular, well worth the time to take a look.