Archive for the ‘Text Mining’ Category

How to Build a Text Mining, Machine Learning….

Monday, May 13th, 2013

How to Build a Text Mining, Machine Learning Document Classification System in R! by Timothy DAuria.

From the description:

We show how to build a machine learning document classification system from scratch in less than 30 minutes using R. We use a text mining approach to identify the speaker of unmarked presidential campaign speeches. Applications in brand management, auditing, fraud detection, electronic medical records, and more.

Well made video introduction to R and text mining.

Extracting and connecting chemical structures…

Saturday, April 27th, 2013

Extracting and connecting chemical structures from text sources using chemicalize.org by Christopher Southan and Andras Stracz.

Abstract:

Background

Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.

Results

Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.

Conclusion

This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

PLOS Text Mining Collection

Sunday, April 21st, 2013

The PLOS Text Mining Collection has launched!

From the webpage:

Across all realms of the sciences and beyond, the rapid growth in the number of works published digitally presents new challenges and opportunities for making sense of this wealth of textual information. The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and to revolutionize how scientists access and interpret data that might otherwise remain buried in the literature.

Here PLOS acknowledges the growing body of work in the area of Text Mining by bringing together major reviews and new research studies published in PLOS journals to create the PLOS Text Mining Collection. It is no coincidence that research in Text Mining in PLOS journals is burgeoning: the widespread uptake of the Open Access publishing model developed by PLOS and other publishers now makes it easier than ever to obtain, mine and redistribute data from published texts. The launch of the PLOS Text Mining Collection complements related PLOS Collections on Open Access and Altmetrics, and further underscores the importance of the PLOS Application Programming Interface, which provides an open source interface with which to mine PLOS journal content.

The Collection is now open across the PLOS journals to all authors who wish to submit research or reviews in this area. Articles are presented below in order of publication date and new articles will be added to the Collection as they are published.

An impressive start to what promises to be a very rich resource!

I first saw this at: New: PLOS Text Mining.

…2,958 Nineteenth-Century British Novels

Monday, March 18th, 2013

A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method by Ryan Heuser and Long Le-Khac.

From the introduction:

The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.

This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.

This branch of the digital humanities, the macroscopic study of cultural history, is a field that is still constructing itself. The right methods and tools are not yet certain, which makes for the excitement and difficulty of the research. We found that such decisions about process cannot be made a priori, but emerge in the messy and non-linear process of working through the research, solving problems as they arise. From this comes the odd, narrative form of this paper, which aims to present the twists and turns of this process of literary and methodological insight. We have divided the paper into two major parts, the development of the methodology (Sections 1 through 3) and the story of our results (Sections 4 and 5). In actuality, these two processes occurred simultaneously; pursuing our literary-historical questions necessitated developing new methodologies. But for the sake of clarity, we present them as separate though intimately related strands.

If this sounds far afield from mining tweets, emails, corporate documents or government archives, can you articulate the difference?

Or do we reflexively treat some genres of texts as “different?”

How useful you will find some of the techniques outlined will depend on the purpose of your analysis.

If you are only doing key-word searching, this isn’t likely to be helpful.

If on the other hand, you are attempting more sophisticated analysis, read on!

I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.

Finding Shakespeare’s Favourite Words With Data Explorer

Saturday, March 16th, 2013

Finding Shakespeare’s Favourite Words With Data Explorer by Chris Webb.

From the post:

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

If Data Explorer is a GUI on top of M (outdated but a point of origin), it goes up in importance.

From the M link:

The Microsoft code name “M” Modeling Language, hereinafter referred to as M, is a language for modeling domains using text. A domain is any collection of related concepts or objects. Modeling domain consists of selecting certain characteristics to include in the model and implicitly excluding others deemed irrelevant. Modeling using text has some advantages and disadvantages over modeling using other media such as diagrams or clay. A goal of the M language is to exploit these advantages and mitigate the disadvantages.

A key advantage of modeling in text is ease with which both computers and humans can store and process text. Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The language feature of M enables information to be represented in a textual form that is tuned for both the problem domain and the target audience. The M language provides simple constructs for describing the shape of a textual language – that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, M acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage.

I try to not run examples using Shakespeare. I get distracted by the elegance of the text, which isn’t the point of the exercise. ;-)

Document Mining with Overview:…

Friday, March 15th, 2013

Document Mining with Overview:… A Digital Tools Tutorial by Jonathan Stray.

The slides from the Overview presentation I mentioned yesterday.

One of the few webinars I have ever attended where nodding off was not a problem! Interesting stuff.

It is designed for the use case where there “…is too much material to read on deadline.”

A cross between document mining and document management.

A cross that hides a lot of the complexity from the user.

Definitely a project to watch.

Visualizing the Topical Structure of the Medical Sciences:…

Thursday, March 14th, 2013

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach by André Skupin, Joseph R. Biberstine, Katy Börner. (Skupin A, Biberstine JR, Börner K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach. PLoS ONE 8(3): e58779. doi:10.1371/journal.pone.0058779)

Abstract:

Background

We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.

Methodology

Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.

Conclusions

Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Impressive work to say the least!

But I was just as impressed by the future avenues for research:

Controlled Vocabularies

It appears that the use of indexer-chosen keywords, including in the case of a large controlled vocabulary-MeSH terms in this study-raises interesting questions. The rank transition diagram in particular helped to highlight the fact that different vocabulary items play different roles in indexers’ attempts to characterize the content of specific publications. The complex interplay of hierarchical relationships and functional roles of MeSH terms deserves further investigation, which may inform future efforts of how specific terms are handled in computational analysis. For example, models constructed from terms occurring at intermediate levels of the MeSH hierarchy might look and function quite different from the top-level model presented here.

User-centered Studies

Future user studies will include term differentiation tasks to help us understand whether/how users can differentiate senses of terms on the self-organizing map. When a term appears prominently in multiple places, that indicates multiple senses or contexts for that term. One study might involve subjects being shown two regions within which a particular label term appears and the abstracts of several papers containing that term. Subjects would then be asked to rate each abstract along a continuum between two extremes formed by the two senses/contexts. Studies like that will help us evaluate how understandable the local structure of the map is.

There are other, equally interesting future research questions but those are the two of most interest to me.

I take this research as evidence that managing semantic diversity is going to require human effort, augmented by automated means.

I first saw this in Nat Torkington’s Four short links: 13 March 2013.

Document Mining with Overview:… [Webinar - March 15, 2013]

Thursday, March 14th, 2013

Document Mining with Overview: A Digital Tools Tutorial

From the post:

Friday, March 15, 2013 at 2:00pm Eastern Time Enroll Now

Overview is a free tool for journalists that automatically organizes a large set of documents by topic, and displays them in an interactive visualization for exploration, tagging, and reporting. Journalists have already used it to report on FOIA document dumps, emails, leaks, archives, and social media data. In fact it will work on any set of documents that is mostly text. It integrates with DocumentCloud and can import your projects, or you can upload data directly in CSV form.

You can’t read 10,000 pages on deadline, but Overview can help you rapidly figure out which pages are the important ones — even if you’re not sure what you’re looking for.

This training event is part of a series on digital tools in partnership with the American Press Institute and The Poynter Institute, funded by the John S. and James L. Knight Foundation.

See more tools in the Digital Tools Catalog.

I have been meaning to learn more about “Overview” and this looks like a good opportunity.

Voyeur Tools: See Through Your Texts

Thursday, February 28th, 2013

Voyeur Tools: See Through Your Texts

From the website:

Voyeur is a web-based text analysis environment. It is designed to be user-friendly, flexible and powerful. Voyeur is part of the Hermeneuti.ca, a collaborative project to develop and theorize text analysis tools and text analysis rhetoric. This section of the Hermeneuti.ca web site provides information and documentation for users and developers of Voyeur.

What you can do with Voyeur:

  • use texts in a variety of formats including plain text, HTML, XML, PDF, RTF and MS Word
  • use texts from different locations, including URLs and uploaded files
  • perform lexical analysis including the study of frequency and distribution data; in particular
  • export data into other tools (as XML, tab separated values, etc.)
  • embed live tools into remote web sites that can accompany or complement your own content

One of the tools used in the Lincoln Logarithms project.

Lincoln Logarithms: Finding Meaning in Sermons

Thursday, February 28th, 2013

Lincoln Logarithms: Finding Meaning in Sermons

From the webpage:

Just after his death, Abraham Lincoln was hailed as a luminary, martyr, and divine messenger. We wondered if using digital tools to analyze a digitized collection of elegiac sermons might uncover patterns or new insights about his memorialization.

We explored the power and possibility of four digital tools—MALLET, Voyant, Paper Machines, and Viewshare. MALLET, Paper Machines, and Voyant all examine text. They show how words are arranged in texts, their frequency, and their proximity. Voyant and Paper Machines also allow users to make visualizations of word patterns. Viewshare allows users to create timelines, maps, and charts of bodies of material. In this project, we wanted to experiment with understanding what these tools, which are in part created to reveal, could and could not show us in a small, but rich corpus. What we have produced is an exploration of the possibilities and the constraints of these tools as applied to this collection.

The resulting digital collection: The Martyred President: Sermons Given on the Assassination of President Lincoln.

Let’s say this is not an “ahistorical” view. ;-)

Good example of exploring “unstructured” data.

A first step before authoring a topic map.

An Interactive Analysis of Tolkien’s Works

Wednesday, February 27th, 2013

An Interactive Analysis of Tolkien’s Works by Emil Johansson.

Description:

Being passionate about both Tolkien and data visualization creating an interactive analysis of Tolkien’s books seemed like a wonderful idea. To the left you will be able to explore character mentions and keyword frequency as well as sentiment analysis of the Silmarillion, the Hobbit and the Lord of the Rings. Information on editions of the books and methods used can be found in the about section.

There you will find:

WORD COUNT AND DENSITY
CHARACTER MENTIONS
KEYWORD FREQUENCY
COMMON WORDS
SENTIMENT ANALYSIS
CHARACTER CO-OCCURENCE
CHAPTER LENGTHS
WORD APPEARANCE
POSTERS

Truly remarkable analysis and visualization!

I suspect users of this portal don’t wonder so much about “how” is it done, but concentrate on the benefits it brings.

Does that sound like a marketing idea for topic maps?

I first saw this in the DashingD3js.com Weekly Newsletter.

Text Processing (part 1) : Entity Recognition

Sunday, February 17th, 2013

Text Processing (part 1) : Entity Recognition by Ricky Ho.

From the post:

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Looking forward to this series!

Text as Data:…

Sunday, February 3rd, 2013

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts by Justin Grimmer and Brandon M. Stewart.

Abstract:

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

As a former political science major, I had to stop to read this article.

A wide ranging survey of an “exciting new area of research” but I remember content/text analysis as an undergraduate, North of forty years ago now.

True, some of the measures are new, along with better visualization techniques.

On the other hand, many of the problems of textual analysis now were the problems in textual analysis then (and before).

Highly recommended as a survey of current techniques.

A history of the “problems” of textual analysis and their resistance to various techniques will have to await another day.

Comment Visualization

Saturday, February 2nd, 2013

New from Juice Labs: A visualization tool for exploring text data by Zach Gemignani.

From the post:

Today we are pleased to release another free tool on Juice Labs. The Comment visualization is the perfect way to exploring qualitative data like text survey responses, tweets, or product reviews. A few of the fun features:

  • Color comments based on a selected value
  • Filter comments using an interactive distribution chart at the top
  • Highlight the most interesting comments by selecting the flags in the upper right
  • Show the author and other contextual information about a comment

[skipping the lamest Wikipedia edits example]

Like our other free visualization tools in Juice Labs, the Comments visualization is designed for ease of use and sharing. Just drop in your own data, choose what fields you want to show as text and as values, and the visualization will immediately reflect your choices. The save button gives you a link that includes your data and settings.

Apparently the interface starts with the lamest Wikipedia edit data.

To change that, you have to scroll down to the Data window, Hover over Learn how.

I have reformatted the how-to content here:

Put any comma delimited data in this box. The first row needs to contain the column names. Then, give us some hints on how to use your data.

[Pre-set column names]

[*] Use this column as the question.

[a] Use this column as the author.

[cby] Use this column to color the comments. Should be a metric. By default, the comments will be sorted in ascending order.

[-] Sort the comments in descending order of the metric value. Can only be used with [cby]

[c] Use this column as a context.

Tip: you can combine the hints like: [c-cby]

Could be an interesting tool for quick and dirty exploration of textual content.

What tools do you use for information gathering and publishing?

Thursday, January 24th, 2013

What tools do you use for information gathering and publishing? by Mac Slocum.

From the post:

Many apps claim to be the pinnacle of content consumption and distribution. Most are a tangle of silly names and bad interfaces, but some of these tools are useful. A few are downright empowering.

Finding those good ones is the tricky part. I queried O’Reilly colleagues to find out what they use and why, and that process offered a decent starting point. We put all our notes together into this public Hackpad — feel free to add to it. I also went through and plucked out some of the top choices. Those are posted below.

Information gathering, however humble it may be, is the start of any topic map authoring project.

Mac asks for the tools you use every week.

Let’s not disappoint him!

silenc: Removing the silent letters from a body of text

Sunday, January 20th, 2013

silenc: Removing the silent letters from a body of text by Nathan Yau.

From the post:

During a two-week visualization course, Momo Miyazaki, Manas Karambelkar, and Kenneth Aleksander Robertsen imagined what a body of text would be without the the silent letters in silenc.

Nathan suggest it isn’t fancy on the analysis side but the views are interesting.

True enough that removing silent letters (once mapped) isn’t difficult, but the results of the technique may be more than just visually interesting.

Usage patterns of words with silent letters would be an interesting question.

Or extending the technique to remove all adjectives from a text (that would shorten ad copy).

“Seeing” text or data from a different or unexpected perspective can lead to new insights. Some useful, some less so.

But it is the job of analysis to sort them out.

Interactive Text Mining

Sunday, January 20th, 2013

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)

Abstract:

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.

Definitely an article you will want to read if you are designing text mining as part of a topic map solution.

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Saturday, January 19th, 2013

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. ;-)

Biocomputing 2013

MongoDB Text Search Tutorial

Thursday, January 17th, 2013

MongoDB Text Search Tutorial by Alex Popescu.

From the post:

Today is the day of the experimental MongoDB text search feature. Tobias Trelle continues his posts about this feature providing some examples for query syntax (negation, phrase search)—according to the previous post even more advanced queries should be supported, filtering and projections, multiple text fields indexing, and adding details about the stemming solution used (Snowball).

Alex also has a list of his posts on the text search feature for MongoDB.

Taming Text is released!

Sunday, January 13th, 2013

Taming Text is released! by Mike McCandless.

From the post:

There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

You can see:

Table of Contents.

Sample chapter 1

Sample chapter 8

Source code (98 MB)

Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.

More comments to follow!

Analyzing the Enron Data…

Saturday, December 29th, 2012

Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-) .

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

  • Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
  • Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
  • Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

Coreference Resolution: to What Extent Does it Help NLP Applications?

Tuesday, December 18th, 2012

Coreference Resolution: to What Extent Does it Help NLP Applications? by Ruslan Mitkov. (presentation – audio only)

The paper from the same conference:

Coreference Resolution: To What Extent Does It Help NLP Applications? by Ruslan Mitkov, Richard Evans, Constantin Orăsan, Iustin Dornescu, Miguel Rios. (Text, Speech and Dialogue, 15th International Conference, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings, pp. 16-27)

Abstract:

This paper describes a study of the impact of coreference resolution on NLP applications. Further to our previous study [1], in which we investigated whether anaphora resolution could be beneficial to NLP applications, we now seek to establish whether a different, but related task—that of coreference resolution, could improve the performance of three NLP applications: text summarisation, recognising textual entailment and text classification. The study discusses experiments in which the aforementioned applications were implemented in two versions, one in which the BART coreference resolution system was integrated and one in which it was not, and then tested in processing input text. The paper discusses the results obtained.

In the presentation and in the paper, Mitkov distinguishes between anaphora and coreference resolution (from the paper):

While some authors use the terms coreference (resolution) and anaphora (resolution) interchangeably, it is worth noting that they are completely distinct terms or tasks [3]. Anaphora is cohesion which points back to some previous item, with the ‘pointing back’ word or phrase called an anaphor, and the entity to which it refers, or for which it stands, its antecedent. Coreference is the act of picking out the same referent in the real world. A specific anaphor and more than one of the preceding (or following) noun phrases may be coreferential, thus forming a coreferential chain of entities which have the same referent.

I am not sure why the “real world” is necessary in: “Coreference is the act of picking out the same referent in the real world.”

For topic maps, I would shorten it to: Coreference is the act of picking out the same referent. (full stop)

The paper is a useful review of coreference systems and quite unusually, reports a negative result:

This study sought to establish whether or not coreference resolution could have a positive impact on NLP applications, in particular on text summarisation, recognising textual entailment, and text categorisation. The evaluation results presented in Section 6 are in line with previous experiments conducted both by the present authors and other researchers: there is no statistically significant benefit brought by automatic coreference resolution to these applications. In this specific study, the employment of the coreference resolution system distributed in the BART toolkit generally evokes slight but not significant increases in performance and in some cases it even evokes a slight deterioration in the performance results of these applications. We conjecture that the lack of a positive impact is due to the success rate of the BART coreference resolution system which appears to be insufficient to boost performance of the aforementioned applications.

My conjecture is topic maps can boost conference resolution enough to improve performance of NLP applications, including text summarisation, recognising textual entailment, and text categorisation.

What do you think?

How would you suggest testing that conjecture?

Taming Text [Coming real soon now!]

Thursday, December 13th, 2012

Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris.

During a webinar today Grant said that “Taming Text” should be out in ebook form in just a week or two.

Grant is giving up the position of being the second longest running MEAP project. (He didn’t say who was first.)

Let’s all celebrate Grant and his co-authors crossing the finish line with a record number of sales!

This promises to be a real treat!

PS: Not going to put this on my wish list, too random and clumsy a process. Will just order it direct. ;-)

Bash One-Liners Explained (series)

Wednesday, November 28th, 2012

Bash One-Liners Explained by Peteris Krumins.

The series page for posts by Peteris Krumins on Bash one-liners.

So far:

One real advantage to Bash scripts is the lack of a graphical interface to get in the way.

A real advantage with “data” files but many times “text” files as well.

eGIFT: Mining Gene Information from the Literature

Thursday, November 22nd, 2012

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.

Abstract:

Background

With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results

In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions

Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

Website: http://biotm.cis.udel.edu/eGIFT

Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

Developing a biocuration workflow for AgBase… [Authoring Interfaces]

Thursday, November 22nd, 2012

Developing a biocuration workflow for AgBase, a non-model organism database by Lakshmi Pillai, Philippe Chouvarine, Catalina O. Tudor, Carl J. Schmidt, K. Vijay-Shanker and Fiona M. McCarthy.

Abstract:

AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase.

Database URL: http://www.agbase.msstate.edu/

Another approach to biocuration. I will be posting on eGift separately but do note this is a domain specific tool.

The authors did not set out to create the universal curation tool but one suited to their specific data and requirements.

I think there is an important lesson here for semantic authoring interfaces. Word processors offer very generic interfaces but consequently little in the way of structure. Authoring annotated information requires more structure and that requires domain specifics.

Now there is an idea, create topic map authoring interfaces on top of a common skeleton, instead of hard coding interfaces as users “should” use the tool.

Accelerating literature curation with text-mining tools:…

Monday, November 19th, 2012

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.

Abstract:

Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. ;-)

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

… text mining in the FlyBase genetic literature curation workflow

Sunday, November 18th, 2012

Opportunities for text mining in the FlyBase genetic literature curation workflow by Peter McQuilton. (Database (2012) 2012 : bas039 doi: 10.1093/database/bas039)

Abstract:

FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information.

Database URL: http://flybase.org

Would you believe that ambiguity is problem #1 and describing relationships is another one?

The most common problem encountered during curation is an ambiguous genetic entity (gene, mutant allele, transgene, etc.). This situation can arise when no unique identifier (such as a FlyBase gene identifier (FBgn) or a computed gene (CG) number for genes), or an accurate and explicit reference for a mutant or transgenic line is given. Ambiguity is a particular problem when a generic symbol/ name is used (e.g. ‘Actin’ or UAS-Notch), or when a symbol/ name is used that is a synonym for a different entity (e.g. ‘ras’ is the current FlyBase symbol for the ‘raspberry’ gene, FBgn0003204, but is often used in the literature to refer to the ‘Ras85D’ gene, FBgn0003205). A further issue is that some symbols only differ in case-sensitivity for the first character, for example, the genes symbols ‘dl’ (dorsal) and ‘Dl’ (Delta). These ambiguities can usually be resolved by searching for associated details about the entity in the article (e.g. the use of a specific mutant allele can identify the gene being discussed) or by consulting the supplemental information for additional details. Sometimes we have to do some analysis ourselves, such as performing a BLAST search using any sequence data present in the article or supplementary files or executing an in-house script to report those entities used by a specified author in previously curated articles. As a final step, if we cannot resolve a problem, we email the corresponding author for clarification. If the ambiguity still cannot be resolved, then a curator will either associate a generic/unspecified entry for that entity with the article, or else omit the entity and add a (non-public) note to the curation record explaining the situation, with the hope that future publications will resolve the issue.

One of the more esoteric problems found in curation is the fact that multiple relationships exist between the curated data types. For example, the ‘dppEP2232 allele’ is caused by the ‘P{EP}dppEP2232 insertion’ and disrupts the ‘dpp gene’. This can cause problems for text-mining assisted curation, as the data can be attributed to the wrong object due to sentence structure or the requirement of back- ground or contextual knowledge found in other parts of the article. In cases like this, detailed knowledge of the FlyBase proforma and curation rules, as well as a good knowledge of Drosophila biology, is necessary to ensure the correct proforma field is filled in. This is one of the reasons why we believe text-mining methods will assist manual curation rather than replace it in the near term.

I like the “manual curation” line. Curation is a task best performed by a sentient being.

A Language-Independent Approach to Keyphrase Extraction and Evaluation

Sunday, November 18th, 2012

A Language-Independent Approach to Keyphrase Extraction and Evaluation (2010) by Mari-sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela.

Abstract:

We present Likey, a language-independent keyphrase extraction method based on statistical analysis and the use of a reference corpus. Likey has a very light-weight preprocessing phase and no parameters to be tuned. Thus, it is not restricted to any single language or language family. We test Likey having exactly the same configuration with 11 European languages. Furthermore, we present an automatic evaluation method based on Wikipedia intra-linking.

Useful approach for developing a rough-cut of keywords in documents. Keywords that may indicate a need for topics to represent subjects.

Interesting that:

Phrases occurring only once in the document cannot be selected as keyphrases.

I would have thought unique phrases would automatically qualify as keyphrases. The ranking of phrases, calculated with the reference corpus and text, excludes unique phrases, in the absence of any ratio for ranking.

That sounds like a bug and not a feature to me.

Reasoning that phrases unique to an author are unique identifications of subjects. Certainly grist for a topic map mill.

Web based demonstration: http://cog.hut.fi/likeydemo/.

Mari-Sanna Paukkeri: Contact details and publications.

Layout-aware text extraction from full-text PDF of scientific articles

Monday, October 8th, 2012

Layout-aware text extraction from full-text PDF of scientific articles by Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy and Gully APC Burns. (Source Code for Biology and Medicine 2012, 7:7 doi:10.1186/1751-0473-7-7)

Abstract:

Background

The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.

Results

Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.

Conclusions

LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.

Scanning TOCs from a variety of areas can uncover goodies like this one.

What is the most recent “unexpected” paper/result outside your “field” have you found?