Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 28, 2013

Theory and Applications for Advanced Text Mining

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 7:30 pm

Theory and Applications for Advanced Text Mining edited by Shigeaki Sakurai.

From the post:

Book chapters include:

  • Survey on Kernel-Based Relation Extraction by Hanmin Jung, Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song
  • Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents by Hidenao Abe
  • Text Clumping for Technical Intelligence by Alan Porter and Yi Zhang
  • A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining by Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino
  • Ontology Learning Using Word Net Lexical Expansion and Text Mining by Hiep Luong, Susan Gauch and Qiang Wang
  • Automatic Compilation of Travel Information from Texts: A Survey by Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa
  • Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques by Masaomi Kimura
  • Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools by David Campos, Sergio Matos and Jose Luis Oliveira
  • Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language by Fadoua Ataa Allah and Siham Boulaknadel

Download the book or the chapters at:

www.intechopen.com/books/theory-and-applications-for-advanced-text-mining

Is it just me or have more data mining/analysis books been appearing as open texts alongside traditional print publication? Than say five years ago?

October 4, 2013

The Irony of Obamacare:…

Filed under: Politics,Text Analytics,Text Mining — Patrick Durusau @ 3:49 pm

The Irony of Obamacare: Republicans Thought of It First by Meghan Foley.

From the post:

“An irony of the Patient Protection and Affordable Care Act (Obamacare) is that one of its key provisions, the individual insurance mandate, has conservative origins. In Congress, the requirement that individuals to purchase health insurance first emerged in Republican health care reform bills introduced in 1993 as alternatives to the Clinton plan. The mandate was also a prominent feature of the Massachusetts plan passed under Governor Mitt Romney in 2006. According to Romney, ‘we got the idea of an individual mandate from [Newt Gingrich], and [Newt] got it from the Heritage Foundation.’” – Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach

That irony led John Wilkerson of the University of Washington and his colleagues David Smith and Nick Stramp to study the legislative history of the health care reform law using a text-analysis system to understand its origins.

Scholars rely almost exclusively on floor roll call voting patterns to assess partisan cooperation in Congress, according to findings in the paper, Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach. By that standard, the Affordable Care was a highly partisan bill. Yet a different story emerges when the source of the reform’s policy is analyzed. The authors’ findings showed that a number of GOP policy ideas overlap with provisions in the Affordable Care Act: Of the 906-page law, 3 percent of the “policy ideas” used wording similar to bills sponsored by House Republicans and 8 percent used wording similar to bills sponsored by Senate Republicans.

In the paper, the authors say:

Our approach is to focus on legislative text. We assume that two bills share a policy idea when they share similar text. Of course, this raises many questions about whether similar text does actually capture shared policy ideas. This paper constitutes an early cut at the question.

The same thinking, similar text = similar ideas, permeates prior art searches on patents as well.

A more fruitful search would be of donor statements, proposals, literature for similar language/ideas.

In that regard, members of the United States Congress are just messengers.

PS: Thanks to Sam Hunting for the pointer to this article!

September 4, 2013

Text Analysis With R

Filed under: R,Text Analytics — Patrick Durusau @ 6:59 pm

Text Analysis With R for Students of Literature by Matthew L. Jockers.

A draft text asking for feedback but it has to be more enjoyable than some of the standards I have been reading. 😉

For that matter, some of the techniques Matthew describes should be useful in working with standards drafts.

April 10, 2013

Apache cTAKES

Apache cTAKES

From the webpage:

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities from various dictionaries including the Unified Medical Language System (UMLS) – medications, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, subject (patient, family member, etc.) and context (negated/not negated, conditional, generic, degree of certainty). Some of the attributes are expressed as relations, for example the location of a clinical condition (locationOf relation) or the severity of a clinical condition (degreeOf relation).

Apache cTAKES was built using the Apache UIMA Unstructured Information Management Architecture engineering framework and Apache OpenNLP natural language processing toolkit. Its components are specifically trained for the clinical domain out of diverse manually annotated datasets, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems and clinical research. cTAKES has been used in a variety of use cases in the domain of biomedicine such as phenotype discovery, translational science, pharmacogenomics and pharmacogenetics.

Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:

  1. Sentence boundary detection
  2. Tokenization (rule-based)
  3. Morphologic normalization
  4. POS tagging
  5. Shallow parsing
  6. Named Entity Recognition
    • Dictionary mapping
    • Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
  7. Assertion module
  8. Dependency parser
  9. Constituency parser
  10. Semantic Role Labeler
  11. Coreference resolver
  12. Relation extractor
  13. Drug Profile module
  14. Smoking status classifier

The goal of cTAKES is to be a world-class natural language processing system in the healthcare domain. cTAKES can be used in a great variety of retrievals and use cases. It is intended to be modular and expandable at the information model and method level.
The cTAKES community is committed to best practices and R&D (research and development) by using cutting edge technologies and novel research. The idea is to quickly translate the best performing methods into cTAKES code.

Processing a text with cTAKES is a processing of adding semantic information to the text.

As you can imagine, the better the semantics that are added, the better searching and other functions become.

In order to make added semantic information interoperable, well, that’s a topic map question.

I first saw this in a tweet by Tim O’Reilly.

March 18, 2013

…2,958 Nineteenth-Century British Novels

Filed under: Literature,Text Analytics,Text Mining — Patrick Durusau @ 10:27 am

A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method by Ryan Heuser and Long Le-Khac.

From the introduction:

The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.

This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.

This branch of the digital humanities, the macroscopic study of cultural history, is a field that is still constructing itself. The right methods and tools are not yet certain, which makes for the excitement and difficulty of the research. We found that such decisions about process cannot be made a priori, but emerge in the messy and non-linear process of working through the research, solving problems as they arise. From this comes the odd, narrative form of this paper, which aims to present the twists and turns of this process of literary and methodological insight. We have divided the paper into two major parts, the development of the methodology (Sections 1 through 3) and the story of our results (Sections 4 and 5). In actuality, these two processes occurred simultaneously; pursuing our literary-historical questions necessitated developing new methodologies. But for the sake of clarity, we present them as separate though intimately related strands.

If this sounds far afield from mining tweets, emails, corporate documents or government archives, can you articulate the difference?

Or do we reflexively treat some genres of texts as “different?”

How useful you will find some of the techniques outlined will depend on the purpose of your analysis.

If you are only doing key-word searching, this isn’t likely to be helpful.

If on the other hand, you are attempting more sophisticated analysis, read on!

I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.

February 27, 2013

An Interactive Analysis of Tolkien’s Works

Filed under: Graphics,Literature,Text Analytics,Text Mining,Visualization — Patrick Durusau @ 2:54 pm

An Interactive Analysis of Tolkien’s Works by Emil Johansson.

Description:

Being passionate about both Tolkien and data visualization creating an interactive analysis of Tolkien’s books seemed like a wonderful idea. To the left you will be able to explore character mentions and keyword frequency as well as sentiment analysis of the Silmarillion, the Hobbit and the Lord of the Rings. Information on editions of the books and methods used can be found in the about section.

There you will find:

WORD COUNT AND DENSITY
CHARACTER MENTIONS
KEYWORD FREQUENCY
COMMON WORDS
SENTIMENT ANALYSIS
CHARACTER CO-OCCURENCE
CHAPTER LENGTHS
WORD APPEARANCE
POSTERS

Truly remarkable analysis and visualization!

I suspect users of this portal don’t wonder so much about “how” is it done, but concentrate on the benefits it brings.

Does that sound like a marketing idea for topic maps?

I first saw this in the DashingD3js.com Weekly Newsletter.

February 3, 2013

Text as Data:…

Filed under: Data Analysis,Text Analytics,Text Mining,Texts — Patrick Durusau @ 6:58 pm

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts by Justin Grimmer and Brandon M. Stewart.

Abstract:

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

As a former political science major, I had to stop to read this article.

A wide ranging survey of an “exciting new area of research” but I remember content/text analysis as an undergraduate, North of forty years ago now.

True, some of the measures are new, along with better visualization techniques.

On the other hand, many of the problems of textual analysis now were the problems in textual analysis then (and before).

Highly recommended as a survey of current techniques.

A history of the “problems” of textual analysis and their resistance to various techniques will have to await another day.

January 20, 2013

silenc: Removing the silent letters from a body of text

Filed under: Graphics,Text Analytics,Text Mining,Texts,Visualization — Patrick Durusau @ 8:05 pm

silenc: Removing the silent letters from a body of text by Nathan Yau.

From the post:

During a two-week visualization course, Momo Miyazaki, Manas Karambelkar, and Kenneth Aleksander Robertsen imagined what a body of text would be without the the silent letters in silenc.

Nathan suggest it isn’t fancy on the analysis side but the views are interesting.

True enough that removing silent letters (once mapped) isn’t difficult, but the results of the technique may be more than just visually interesting.

Usage patterns of words with silent letters would be an interesting question.

Or extending the technique to remove all adjectives from a text (that would shorten ad copy).

“Seeing” text or data from a different or unexpected perspective can lead to new insights. Some useful, some less so.

But it is the job of analysis to sort them out.

January 13, 2013

Taming Text is released!

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 8:09 pm

Taming Text is released! by Mike McCandless.

From the post:

There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

You can see:

Table of Contents.

Sample chapter 1

Sample chapter 8

Source code (98 MB)

Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.

More comments to follow!

December 29, 2012

Analyzing the Enron Data…

Filed under: Clustering,PageRank,Statistics,Text Analytics,Text Mining — Patrick Durusau @ 6:07 am

Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-).

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

  • Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
  • Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
  • Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

December 12, 2012

UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses

Filed under: Predictive Analytics,Text Analytics — Patrick Durusau @ 8:42 pm

UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses

From the post:

Predictive Analytics Certificate Program:

This program is designed for professionals who are using or wish to use Predictive Analytics to optimize business performance at a variety of levels. UC Irvine Extension is offering the following webinar and two courses during winter quarter:

Predictive Analytics Special Topic Webinar: Text Analytics & Text Mining (Jan. 15, 11:30 a.m. to 12:30 p.m., PST) – This free webinar will provide participants with the introductory concepts of text analytics and text mining that are used to recognize how stored, unstructured data represents an extremely valuable source of business information.

Course: Effective Data Preparation (Jan. 7 to Feb. 24) – This online course will address how to extract stored data elements, transform their formats, and derive new relationships among them, in order to produce a dataset suitable for analytical modeling. Course instructor Dr. Robert Nisbet, chief scientist at Smogfarm, which studies crowd psychology, will provide attendees with the skills to produce a fully processed data set compatible for building powerful predictive models.

Course: Text Analytics & Text Mining (Jan. 28 to March 24) – This new online course instructed by Dr. Gary Miner, author of Handbook of Statistical Analysis & Data Mining Applications and Practical Text Mining, will focus on basic concepts of textual information including tokenization and part-of-speech tagging. The course will expose participants to practical techniques for text extraction and text mining, document clustering and classification, information retrieval, and the enhancement of structured data.

Just so you know, the webinar is free but Effective Data Preparation and Text Analytics & Text Mining are $695.00 each.

I am always made more curious by the omission of the most obvious questions from an FAQ or location of the information in very non-prominent places.

I suspect well worth the price but why not be up front with the charges?

August 24, 2012

Going Beyond the Numbers:…

Filed under: Analytics,Text Analytics,Text Mining — Patrick Durusau @ 1:39 pm

Going Beyond the Numbers: How to Incorporate Textual Data into the Analytics Program by Cindi Thompson.

From the post:

Leveraging the value of text-based data by applying text analytics can help companies gain competitive advantage and an improved bottom line, yet many companies are still letting their document repositories and external sources of unstructured information lie fallow.

That’s no surprise, since the application of analytics techniques to textual data and other unstructured content is challenging and requires a relatively unfamiliar skill set. Yet applying business and industry knowledge and starting small can yield satisfying results.

Capturing More Value from Data with Text Analytics

There’s more to data than the numerical organizational data generated by transactional and business intelligence systems. Although the statistics are difficult to pin down, it’s safe to say that the majority of business information for a typical company is stored in documents and other unstructured data sources, not in structured databases. In addition, there is a huge amount of business-relevant information in documents and text that reside outside the enterprise. To ignore the information hidden in text is to risk missing opportunities, including the chance to:

  • Capture early signals of customer discontent.
  • Quickly target product deficiencies.
  • Detect fraud.
  • Route documents to those who can effectively leverage them.
  • Comply with regulations such as XBRL coding or redaction of personally identifiable information.
  • Better understand the events, people, places and dates associated with a large set of numerical data.
  • Track competitive intelligence.

To be sure, textual data is messy and poses difficulties.

But, as Cindi points out, there are golden benefits in those hills of textual data.

July 14, 2012

Finding Structure in Text, Genome and Other Symbolic Sequences

Filed under: Genome,Statistics,Symbol,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 8:58 am

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Accurate Methods for the Statistics of Surprise and Coincidence (1993)

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

July 9, 2012

TUSTEP is open source – with TXSTEP providing a new XML interface

Filed under: Text Analytics,Text Mining,TUSTEP/TXSTEP,XML — Patrick Durusau @ 9:15 am

TUSTEP is open source – with TXSTEP providing a new XML interface

I won’t recount how many years ago I first received email from Wilhelm Ott about TUSTEP. 😉

From the TUSTEP homepage:

TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).

Since the title “big data” is taken, perhaps we should take “complex data” for texts.

If you are exploring textual data in any detail or with XML, you should give take a look at the TUSTEP project and its new XML interface, TXSTEP.

Or consider contributing to the project as well.

Wilhelm Ott writes (in part):

We are pleased to announce that, starting with the release 2012, TUSTEP is available as open source software. It is distributed under the Revised BSD Licence and can be downloaded from www.tustep.org.

TUSTEP has a long tradition as a highly flexible, reliable, efficient suite of programs for humanities computing. It started in the early 70ies as a tool for supporting humanities projects at the University of Tübingen, relying on own funds of the University. From 1985 to 1989, a substantial grant from the Land Baden-Württemberg officially opened its distribution beyond the limits of the University and started its success as a highly appreciated research tool for many projects at about a hundred universities and academic institutions in the German speaking part of the world, represented since 1993 in the International TUSTEP User Group (ITUG). Reports on important projects relying on TUSTEP and a list of publications (includig lexicograpic works and critical editions) can be found on the tustep webpage.

TXSTEP, presently being developed in cooperation with Stuttgart Media University, offers a new XML-based user interface to the TUSTEP programs. Compared to the original TUSTEP commands, we see important advantages:

  • it will offer an up-to-date established syntax for scripting;
  • it will show the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying the code;
  • it will offer – to a certain degree – a self teaching environment by commenting on the scope of every step;
  • it will help to avoid many syntactical errors, even compared to the original TUSTEP scripting environment;
  • the syntax is in English, providing a more widespread usability than TUSTEP’s German command language.

At the TEI conference last year in Würzburg, we presented a first prototype to an international audience. We look forward to DH2012 in Hamburg next week where, during the Poster Session, a more enhanced version which already contains most of TUSTEPs functions will be presented. A demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.

After the demo, you are invited to download a test version of TXSTEP to play with, to comment on it and to help make it a great and flexible tool for everyday – and complex – questions.

OK, I confess a fascination with complex textual analysis.

July 7, 2012

On the origin of long-range correlations in texts

Filed under: Natural Language Processing,Text Analytics — Patrick Durusau @ 2:53 pm

On the origin of long-range correlations in texts by Eduardo G. Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti.

Abstract:

The complexity of human interactions with social and natural phenomena is mirrored in the way we describe our experiences through natural language. In order to retain and convey such a high dimensional information, the statistical properties of our linguistic output has to be highly correlated in time. An example are the robust observations, still largely not understood, of correlations on arbitrary long scales in literary texts. In this paper we explain how long-range correlations flow from highly structured linguistic levels down to the building blocks of a text (words, letters, etc..). By combining calculations and data analysis we show that correlations take form of a bursty sequence of events once we approach the semantically relevant topics of the text. The mechanisms we identify are fairly general and can be equally applied to other hierarchical settings.

Another area of arXiv.org, Physics > Data Analysis, Statistics and Probability, to monitor. 😉

The authors used ten (10) novels from Project Gutenberg:

  • Alice’s Adventures in Wonderland
  • The Adventures of Tom Sawyer
  • Pride and Prejudice
  • Life on the Mississippi
  • The Jungle
  • The Voyage of the Beagle
  • Moby Dick; or The Whale
  • Ulysses
  • Don Quixote
  • War and Peace

Interesting research that will take a while to digest but I have to wonder why these ten (10) novels?

Or perhaps better, in an age of “big data,” why only ten (10)?

Why not the entire corpus of Project Gutenberg?

Or perhaps the texts of Wikipedia in its multitude of languages?

Reasoning that if the results represent an insight about natural language, they should be applicable beyond English. Yes?

If this is your area, comments and suggestions would be most welcome.

June 29, 2012

National Centre for Text Mining (NaCTeM)

Filed under: Text Analytics,Text Extraction,Text Feature Extraction,Text Mining — Patrick Durusau @ 3:15 pm

National Centre for Text Mining (NaCTeM)

From the webpage:

The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to

  • text mining services provided by NaCTeM
  • software tools, both those developed by the NaCTeM team and by other text mining groups
  • seminars, general events, conferences and workshops
  • tutorials and demonstrations
  • text mining publications

Let us know if you would like to include any of the above in our website.

This is a real treasure trove of software, resources and other materials.

I will be working in reports on “finds” at this site for quite some time.

May 29, 2012

ProseVis

Filed under: Data Mining,Graphics,Text Analytics,Text Mining,Visualization — Patrick Durusau @ 10:12 am

ProseVis

A tool for exploring texts on non-word basis.

Or in the words of the project:

ProseVis is a visualization tool developed as part of a use case supported by the Andrew W. Mellon Foundation through a grant titled “SEASR Services,” in which we seek to identify other features than the “word” to analyze texts. These features comprise sound including parts-of-speech, accent, phoneme, stress, tone, break index.

ProseVis allows a reader to map the features extracted from OpenMary (http://mary.dfki.de/) Text-to-speech System and predictive classification data to the “original” text. We developed this project with the ultimate goal of facilitating a reader’s ability to analyze and disseminate the results in human readable form. Research has shown that mapping the data to the text in its original form allows for the kind of human reading that literary scholars engage: words in the context of phrases, sentences, lines, stanzas, and paragraphs (Clement 2008). Recreating the context of the page not only allows for the simultaneous consideration of multiple representations of knowledge or readings (since every reader’s perspective on the context will be different) but it also allows for a more transparent view of the underlying data. If a human can see the data (the syllables, the sounds, the parts-of-speech) within the context in which they are used to reading, with the data mapped back onto the full text, then the reader is empowered within this familiar context to read what might otherwise be an unfamiliar representation tabular representation of the text. For these reasons, we developed ProseVis as a reader interface to allow scholars to work with the data in a language or context in which we are used to saying things about the world.

Textual analysis tools are “smoking gun” detectors.

CEO is unlikely to make inappropriate comments in a spreadsheet or data feed. Emails on the other hand… 😉

Big or little data, the goal is to have the “right” data.

May 19, 2012

From the Bin Laden Letters: Reactions in the Islamist Blogosphere

Filed under: Intelligence,Text Analytics — Patrick Durusau @ 4:41 pm

From the Bin Laden Letters: Reactions in the Islamist Blogosphere

From the post:

Following our initial analysis of the Osama bin Laden letters released by the Combating Terrorism Center (CTC) at West Point, we’ll more closely examine interesting moments from the letters and size them up against what was publicly reported as happening in the world in order to gain a deeper perspective on what was known or unknown at the time.

There was a frenzy of summarization and highlight reel reporting in the wake of the Abbottabad documents being publicly released. Some focused on the idea that Osama bin Laden was ostracized, some pointed to the seeming obsession with image in the media, and others simply took a chance to jab at Joe Biden for the suggestions made about his lack of preparedness for the presidency.

What we’ll do in this post is take a different approach, and rather than focus on analyst viewpoints we’ll compare reactions to the Abbottabad documents from a unique source – Islamist discussion forums.

There we find rebukes over the veracity of the documents released, support for the efforts of operatives such as Faisal Shahzad, and a little interest in the Arab Spring.

Interesting visualizations as always.

The question I would ask as a consumer of such information services is: How do I integrate this analysis with in-house analysis tools?

Or perhaps better: How do I evaluate non-direct references to particular persons or places? That is a person or place is implied but not named. What do I know about the basis for such an identification?

April 30, 2012

Text Analytics: Yesterday, Today and Tomorrow

Filed under: Marketing,Text Analytics — Patrick Durusau @ 3:17 pm

Text Analytics: Yesterday, Today and Tomorrow

Another Tony Russell-Rose post that I ran across over the weekend:

Here’s something I’ve been meaning to share for a while: the slides for a talk entitled “Text Analytics: Yesterday, Today and Tomorrow”, co-authored with colleagues Vladimir Zelevinsky and Michael Ferretti. In this we outline some of the key challenges in text analytics, describe some of Endeca’s current research in this area, examine the current state of the text analytics market and explore some of the prospects for the future.

I was amused to read on slide 40:

Solutions still not standardized

Users differ in their views of the world of texts, solutions, data, formats, data structures, and analysis.

Anyone offering a “standardized” solution is selling their view of the world.

As a user/potential customer, I am rather attached to my view of the world. You?

April 29, 2012

Prostitutes Appeal to Pope: Text Analytics applied to Search

Filed under: Ambiguity,Search Analytics,Searching,Text Analytics — Patrick Durusau @ 3:48 pm

Prostitutes Appeal to Pope: Text Analytics applied to Search by Tony Russell-Rose.

It is hard for me to visit Tony’s site and not come away with several posts he has written that I want to mention. Today was no different.

Here is a sampling of what Tony talks about in this post:

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

A very informative and highly amusing post.

What better way to start the week?

Enjoy!

Text Analytics Summit Europe – highlights and reflections

Filed under: Analytics,Natural Language Processing,Text Analytics — Patrick Durusau @ 2:01 pm

Text Analytics Summit Europe – highlights and reflections by Tony Russell-Rose.

Earlier this week I had the privilege of attending the Text Analytics Summit Europe at the Royal Garden Hotel in Kensington. Some of you may of course recognise this hotel as the base for Justin Bieber’s recent visit to London, but sadly (or is that fortunately?) he didn’t join us. Next time, maybe…

Ranking reasons to attend:

  • #1 Text Analytics Summit Europe – meet other attendees, presentations
  • #2 Kensington Gardens and Hyde Park (been there, it is more impressive than you can imagine)
  • #N +1 Justin Bieber being in London (or any other location)

I was disappointed by the lack of links to slides or videos of the presentations.

Tony’s post does have pointers to people and resources you may have missed.

Question: Do you think “text analytics” and “data mining” are different? If so, how?

April 17, 2012

Superfastmatch: A text comparison tool

Filed under: Duplicates,News,Text Analytics — Patrick Durusau @ 7:12 pm

Superfastmatch: A text comparison tool by Donovan Hide.

Slides on a Chrome extension that compares news stories for unique content.

Would be interesting to compare 24-hour news channels both to themselves and to others on the basis of duplicate content.

Could even have a 15 minute, highlights of the news and deliver most of the non-duplicate content (well, omitting the commercials as well) for any 24-hour period.

Until then, visit this project and see what you think.

March 22, 2012

Text Analytics in Telecommunications – Part 3

Filed under: Machine Learning,Text Analytics — Patrick Durusau @ 7:41 pm

Text Analytics in Telecommunications – Part 3 by Themos Kalafatis.

From the post:

It is well known that FaceBook contains a multitude of information that can be potentially analyzed. A FaceBook page contains several entries (Posts, Photos, Comments, etc) which in turn generate Likes. This data can be analyzed to better understand the behavior of consumers towards a Brand, Product or Service.

Let’s look at the analysis of the three FaceBook pages of MT:S, Telenor and VIP Mobile Telcos in Serbia as an example. The question that this analysis tries to answer is whether we can identify words and phrases that frequently appear in posts that generate any kind of reaction (a “Like”, or a Comment) vs words and topics that do not tend to generate reactions . If we are able to differentiate these words then we get an idea on what consumers tend to value more : If a post is of no value to us then we will not tend to Like it and/or comment it.

To perform this analysis we need a list of several thousands of posts (their text) and also the number of Likes and Comments that each post has received. If any post has generated a Like and/or a Comment then we flag that post as having generated a reaction. The next step is to feed that information to a machine learning algorithm to identify which words have discriminative power (=which words appear more frequently in posts that are liked and/or commented and also which words do not produce any reaction.)

It would be more helpful if the “machine learning algorithm” used in this case was identified, along with the data set in question.

I suppose we will learn more after the presentation at the European Text Analytics Summit, although we would like to learn more sooner! 😉

March 21, 2012

Text Analytics for Telecommunications – Part 2

Filed under: Telecommunications,Text Analytics — Patrick Durusau @ 3:30 pm

Text Analytics for Telecommunications – Part 2 by Themos Kalafatis.

From the post:

In the previous post we have seen the problems that a highly inflected language creates and also a very basic example of Competitive Intelligence. The Case Study that i will present in the forthcoming European Text Analytics Summit is about the analysis of Telco Subscriber conversations on FaceBook and Twitter that involve Telenor, MT:S and VIP Mobile located in Serbia.

It is time to see what Topics are found in subscriber conversations. Each Telco has its own FaceBook page which contains posts and comments generated by page curators and subscribers. Each post and comment also generates “Likes” and “Shares”. Several types of analysis can be performed to find out :

  1. What kind of Topics are discussed in posts and comments of each Telco FaceBook page?
  2. What is the sentiment?
  3. Which posts (and comments) tend to be liked and shared (=generate Interest and reactions)?

Themos continues his series on text analytics for Telcos.

Here he moves into Facebook comments and analysis of the same.

March 20, 2012

Text Analytics for Telecommunications – Part 1

Filed under: Telecommunications,Text Analytics,Text Extraction — Patrick Durusau @ 3:54 pm

Text Analytics for Telecommunications – Part 1 by Themos Kalafatis.

From the post:

As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which i will present in the European Text Analytics Summit is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.

The Telcos used for the Case Study are Telenor, MT:S and VIP Mobile which are located in Serbia. The analysis aims to identify the perception of Customers for each of the three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers – Subscribers.

The start of a very useful series on non-English text analysis. The sort that is in demand by agencies of various governments.

Come to think of it, text analysis of English/non-English government information is probably in demand by non-government groups. 😉

December 24, 2011

RTextTools v1.3.2 Released

Filed under: R,Text Analytics — Patrick Durusau @ 4:42 pm

RTextTools v1.3.2 Released

From the post:

RTextTools was updated to version 1.3.2 today, adding support for n-gram token analysis, a faster maximum entropy algorithm, and numerous bug fixes. The source code has been synced with the Google Code repository, so please feel free to check out a copy and add your own features!

With the core feature set of RTextTools finalized, the next major release (v1.4.0) will focus on optimizing existing code and refining the API for the package. Furthermore, my goal is to add compressed sparse matrix support for all nine algorithms to reduce memory consumption; currently maximum entropy, support vector machines, and glmnet support compressed sparse matrices.

If you are doing text analysis to extract subjects and their properties or have an interest in contributing to a project on text analysis, this may be your chance.

December 21, 2011

Reusable TokenStreams

Filed under: Lucene,Text Analytics — Patrick Durusau @ 7:21 pm

Reusable TokenStreams by Chris Male.

Abstract:

This white paper covers how Lucene’s text analysis system works today and explores the system and provides an understanding of what a TokenStream is, what the difference between Analyzers, TokenFilters and Tokenizers are, and how reuse impacts the design and implementation of each of these components.

Useful treatment of Lucene’s text analysis features. Those are still developing and more changes are promised (but left rather vague) for the future.

One feature that is covered of particular interest was the ability to associate geographic location data with terms deemed to represent locations.

Occurs to me that such a feature could also be used to annotate terms during text analysis to associate subject identifiers with those terms.

An application doesn’t have to “understand” that terms have different meanings so long as it can distinguish one from another based on annotations. (Or map them together despite different identifiers.)

December 17, 2011

Content Analysis

Filed under: Content Analysis,Law - Sources,Legal Informatics,Text Analytics — Patrick Durusau @ 6:33 am

Content Analysis by Michael Heise.

From the post:

Dan Katz (MSU) let me know about a beta release of new website, Legal Language Explorer, that will likely interest anyone who does content analysis as well as those looking for a neat (and, according to Jason Mazzone, addictive) toy to burn some time. The site, according to Dan, allows users: “the chance [free of charge] to search the history of the United States Supreme Court (1791-2005) for any phrase and get a frequency plot and the full text case results for that phrase.” Dan also reports that the developers hope to expand coverage beyond Supreme Court decisions in the future.

The site needs a For Amusement Only sticker. Legal language changes over time and probably no place more so than in Supreme Court decisions.

It was a standing joke in law school that the bar association sponsored the “Avoid Probate” sort of books. If you really want to incur legal fees, just try self-help. Same is true for this site. Use it to argue with your friends, settle bets during football games, etc. Don’t rely on it during night time, road side encounters with folks carrying weapons and radios to summons help. (police)

IBM Redbooks Reveals Content Analytics

Filed under: Analytics,Data Mining,Entity Extraction,Text Analytics — Patrick Durusau @ 6:31 am

IBM Redbooks Reveals Content Analytics

From Beyond Search:

IBM Redbooks has put out some juicy reading for the azure chip consultants wanting to get smart quickly with IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content. The sixteen chapters of this book take the reader from an overview of IBM content analytics, through understanding the details, to troubleshooting tips. The above link provides an abstract of the book, as well as links to download it as a PDF, view in HTML/Java, or order a hardcopy.

Abstract:

With IBM® Content Analytics Version 2.2, you can unlock the value of unstructured content and gain new business insight. IBM Content Analytics Version 2.2 provides a robust interface for exploratory analytics of unstructured content. It empowers a new class of analytical applications that use this content. Through content analysis, IBM Content Analytics provides enterprises with tools to better identify new revenue opportunities, improve customer satisfaction, and provide early problem detection.

To help you achieve the most from your unstructured content, this IBM Redbooks® publication provides in-depth information about Content Analytics. This book examines the power and capabilities of Content Analytics, explores how it works, and explains how to design, prepare, install, configure, and use it to discover actionable business insights.

This book explains how to use the automatic text classification capability, from the IBM Classification Module, with Content Analytics. It explains how to use the LanguageWare® Resource Workbench to create custom annotators. It also explains how to work with the IBM Content Assessment offering to timely decommission obsolete and unnecessary content while preserving and using content that has business value.

The target audience of this book is decision makers, business users, and IT architects and specialists who want to understand and use their enterprise content to improve and enhance their business operations. It is also intended as a technical guide for use with the online information center to configure and perform content analysis with Content Analytics.

The cover article points out the Redbooks have an IBM slant, which isn’t surprising. When you need big iron for an enterprise project, that IBM is one of a handful of possible players isn’t surprising either.

December 4, 2011

FACTA

Filed under: Associations,Bioinformatics,Biomedical,Concept Detection,Text Analytics — Patrick Durusau @ 8:16 pm

FACTA – Finding Associated Concepts with Text Analysis

From the Quick Start Guide:

FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.

A very good example of an exploration tool that isn’t overly complex to use.

« Newer PostsOlder Posts »

Powered by WordPress