Archive for the ‘Text Analytics’ Category
Wednesday, April 10th, 2013
Apache cTAKES
From the webpage:
Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities from various dictionaries including the Unified Medical Language System (UMLS) – medications, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, subject (patient, family member, etc.) and context (negated/not negated, conditional, generic, degree of certainty). Some of the attributes are expressed as relations, for example the location of a clinical condition (locationOf relation) or the severity of a clinical condition (degreeOf relation).
Apache cTAKES was built using the Apache UIMA Unstructured Information Management Architecture engineering framework and Apache OpenNLP natural language processing toolkit. Its components are specifically trained for the clinical domain out of diverse manually annotated datasets, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems and clinical research. cTAKES has been used in a variety of use cases in the domain of biomedicine such as phenotype discovery, translational science, pharmacogenomics and pharmacogenetics.
Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:
- Sentence boundary detection
- Tokenization (rule-based)
- Morphologic normalization
- POS tagging
- Shallow parsing
- Named Entity Recognition
- Dictionary mapping
- Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
- Assertion module
- Dependency parser
- Constituency parser
- Semantic Role Labeler
- Coreference resolver
- Relation extractor
- Drug Profile module
- Smoking status classifier
The goal of cTAKES is to be a world-class natural language processing system in the healthcare domain. cTAKES can be used in a great variety of retrievals and use cases. It is intended to be modular and expandable at the information model and method level.
The cTAKES community is committed to best practices and R&D (research and development) by using cutting edge technologies and novel research. The idea is to quickly translate the best performing methods into cTAKES code.
Processing a text with cTAKES is a processing of adding semantic information to the text.
As you can imagine, the better the semantics that are added, the better searching and other functions become.
In order to make added semantic information interoperable, well, that’s a topic map question.
I first saw this in a tweet by Tim O’Reilly.
Posted in Knowledge Discovery, Medical Informatics, Natural Language Processing, Text Analytics | No Comments »
Monday, March 18th, 2013
A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method by Ryan Heuser and Long Le-Khac.
From the introduction:
The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.
This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.
This branch of the digital humanities, the macroscopic study of cultural history, is a field that is still constructing itself. The right methods and tools are not yet certain, which makes for the excitement and difficulty of the research. We found that such decisions about process cannot be made a priori, but emerge in the messy and non-linear process of working through the research, solving problems as they arise. From this comes the odd, narrative form of this paper, which aims to present the twists and turns of this process of literary and methodological insight. We have divided the paper into two major parts, the development of the methodology (Sections 1 through 3) and the story of our results (Sections 4 and 5). In actuality, these two processes occurred simultaneously; pursuing our literary-historical questions necessitated developing new methodologies. But for the sake of clarity, we present them as separate though intimately related strands.
If this sounds far afield from mining tweets, emails, corporate documents or government archives, can you articulate the difference?
Or do we reflexively treat some genres of texts as “different?”
How useful you will find some of the techniques outlined will depend on the purpose of your analysis.
If you are only doing key-word searching, this isn’t likely to be helpful.
If on the other hand, you are attempting more sophisticated analysis, read on!
I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.
Posted in Literature, Text Analytics, Text Mining | No Comments »
Wednesday, February 27th, 2013
An Interactive Analysis of Tolkien’s Works by Emil Johansson.
Description:
Being passionate about both Tolkien and data visualization creating an interactive analysis of Tolkien’s books seemed like a wonderful idea. To the left you will be able to explore character mentions and keyword frequency as well as sentiment analysis of the Silmarillion, the Hobbit and the Lord of the Rings. Information on editions of the books and methods used can be found in the about section.
There you will find:
WORD COUNT AND DENSITY
CHARACTER MENTIONS
KEYWORD FREQUENCY
COMMON WORDS
SENTIMENT ANALYSIS
CHARACTER CO-OCCURENCE
CHAPTER LENGTHS
WORD APPEARANCE
POSTERS
Truly remarkable analysis and visualization!
I suspect users of this portal don’t wonder so much about “how” is it done, but concentrate on the benefits it brings.
Does that sound like a marketing idea for topic maps?
I first saw this in the DashingD3js.com Weekly Newsletter.
Posted in Graphics, Literature, Text Analytics, Text Mining, Visualization | No Comments »
Sunday, February 3rd, 2013
Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts by Justin Grimmer and Brandon M. Stewart.
Abstract:
Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.
As a former political science major, I had to stop to read this article.
A wide ranging survey of an “exciting new area of research” but I remember content/text analysis as an undergraduate, North of forty years ago now.
True, some of the measures are new, along with better visualization techniques.
On the other hand, many of the problems of textual analysis now were the problems in textual analysis then (and before).
Highly recommended as a survey of current techniques.
A history of the “problems” of textual analysis and their resistance to various techniques will have to await another day.
Posted in Data Analysis, Text Analytics, Text Mining, Texts | No Comments »
Sunday, January 20th, 2013
silenc: Removing the silent letters from a body of text by Nathan Yau.
From the post:
During a two-week visualization course, Momo Miyazaki, Manas Karambelkar, and Kenneth Aleksander Robertsen imagined what a body of text would be without the the silent letters in silenc.
Nathan suggest it isn’t fancy on the analysis side but the views are interesting.
True enough that removing silent letters (once mapped) isn’t difficult, but the results of the technique may be more than just visually interesting.
Usage patterns of words with silent letters would be an interesting question.
Or extending the technique to remove all adjectives from a text (that would shorten ad copy).
“Seeing” text or data from a different or unexpected perspective can lead to new insights. Some useful, some less so.
But it is the job of analysis to sort them out.
Posted in Graphics, Text Analytics, Text Mining, Texts, Visualization | No Comments »
Sunday, January 13th, 2013
Taming Text is released! by Mike McCandless.
From the post:
There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.
I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.
The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!
N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.
You can see:
Table of Contents.
Sample chapter 1
Sample chapter 8
Source code (98 MB)
Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.
More comments to follow!
Posted in Text Analytics, Text Mining | No Comments »
Wednesday, December 12th, 2012
UC Irvine Extension Announces Winter Predictive Analytics and Info System Courses
From the post:
Predictive Analytics Certificate Program:
This program is designed for professionals who are using or wish to use Predictive Analytics to optimize business performance at a variety of levels. UC Irvine Extension is offering the following webinar and two courses during winter quarter:
Predictive Analytics Special Topic Webinar: Text Analytics & Text Mining (Jan. 15, 11:30 a.m. to 12:30 p.m., PST) - This free webinar will provide participants with the introductory concepts of text analytics and text mining that are used to recognize how stored, unstructured data represents an extremely valuable source of business information.
Course: Effective Data Preparation (Jan. 7 to Feb. 24) - This online course will address how to extract stored data elements, transform their formats, and derive new relationships among them, in order to produce a dataset suitable for analytical modeling. Course instructor Dr. Robert Nisbet, chief scientist at Smogfarm, which studies crowd psychology, will provide attendees with the skills to produce a fully processed data set compatible for building powerful predictive models.
Course: Text Analytics & Text Mining (Jan. 28 to March 24) - This new online course instructed by Dr. Gary Miner, author of Handbook of Statistical Analysis & Data Mining Applications and Practical Text Mining, will focus on basic concepts of textual information including tokenization and part-of-speech tagging. The course will expose participants to practical techniques for text extraction and text mining, document clustering and classification, information retrieval, and the enhancement of structured data.
Just so you know, the webinar is free but Effective Data Preparation and Text Analytics & Text Mining are $695.00 each.
I am always made more curious by the omission of the most obvious questions from an FAQ or location of the information in very non-prominent places.
I suspect well worth the price but why not be up front with the charges?
Posted in Predictive Analytics, Text Analytics | No Comments »
Friday, August 24th, 2012
Going Beyond the Numbers: How to Incorporate Textual Data into the Analytics Program by Cindi Thompson.
From the post:
Leveraging the value of text-based data by applying text analytics can help companies gain competitive advantage and an improved bottom line, yet many companies are still letting their document repositories and external sources of unstructured information lie fallow.
That’s no surprise, since the application of analytics techniques to textual data and other unstructured content is challenging and requires a relatively unfamiliar skill set. Yet applying business and industry knowledge and starting small can yield satisfying results.
Capturing More Value from Data with Text Analytics
There’s more to data than the numerical organizational data generated by transactional and business intelligence systems. Although the statistics are difficult to pin down, it’s safe to say that the majority of business information for a typical company is stored in documents and other unstructured data sources, not in structured databases. In addition, there is a huge amount of business-relevant information in documents and text that reside outside the enterprise. To ignore the information hidden in text is to risk missing opportunities, including the chance to:
- Capture early signals of customer discontent.
- Quickly target product deficiencies.
- Detect fraud.
- Route documents to those who can effectively leverage them.
- Comply with regulations such as XBRL coding or redaction of personally identifiable information.
- Better understand the events, people, places and dates associated with a large set of numerical data.
- Track competitive intelligence.
…
To be sure, textual data is messy and poses difficulties.
But, as Cindi points out, there are golden benefits in those hills of textual data.
Posted in Analytics, Text Analytics, Text Mining | No Comments »
Saturday, July 14th, 2012
Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)
Abstract:
The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.
A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.
Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.
Recently posted but dating from 1998.
Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.
Or it can lead you to check an author in Citeseer:
Accurate Methods for the Statistics of Surprise and Coincidence (1993)
Abstract:
Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.
Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)
The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.
As a bonus, it is quite well written and makes an enjoyable read.
Posted in Genome, Statistics, Symbol, Text Analytics, Text Corpus, Text Mining | No Comments »
Monday, July 9th, 2012
TUSTEP is open source – with TXSTEP providing a new XML interface
I won’t recount how many years ago I first received email from Wilhelm Ott about TUSTEP.
From the TUSTEP homepage:
TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).
Since the title “big data” is taken, perhaps we should take “complex data” for texts.
If you are exploring textual data in any detail or with XML, you should give take a look at the TUSTEP project and its new XML interface, TXSTEP.
Or consider contributing to the project as well.
Wilhelm Ott writes (in part):
We are pleased to announce that, starting with the release 2012, TUSTEP is available as open source software. It is distributed under the Revised BSD Licence and can be downloaded from www.tustep.org.
TUSTEP has a long tradition as a highly flexible, reliable, efficient suite of programs for humanities computing. It started in the early 70ies as a tool for supporting humanities projects at the University of Tübingen, relying on own funds of the University. From 1985 to 1989, a substantial grant from the Land Baden-Württemberg officially opened its distribution beyond the limits of the University and started its success as a highly appreciated research tool for many projects at about a hundred universities and academic institutions in the German speaking part of the world, represented since 1993 in the International TUSTEP User Group (ITUG). Reports on important projects relying on TUSTEP and a list of publications (includig lexicograpic works and critical editions) can be found on the tustep webpage.
…
TXSTEP, presently being developed in cooperation with Stuttgart Media University, offers a new XML-based user interface to the TUSTEP programs. Compared to the original TUSTEP commands, we see important advantages:
- it will offer an up-to-date established syntax for scripting;
- it will show the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying the code;
- it will offer – to a certain degree – a self teaching environment by commenting on the scope of every step;
- it will help to avoid many syntactical errors, even compared to the original TUSTEP scripting environment;
- the syntax is in English, providing a more widespread usability than TUSTEP’s German command language.
At the TEI conference last year in Würzburg, we presented a first prototype to an international audience. We look forward to DH2012 in Hamburg next week where, during the Poster Session, a more enhanced version which already contains most of TUSTEPs functions will be presented. A demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.
After the demo, you are invited to download a test version of TXSTEP to play with, to comment on it and to help make it a great and flexible tool for everyday – and complex – questions.
OK, I confess a fascination with complex textual analysis.
Posted in TUSTEP/TXSTEP, Text Analytics, Text Mining, XML | No Comments »
Saturday, July 7th, 2012
On the origin of long-range correlations in texts by Eduardo G. Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti.
Abstract:
The complexity of human interactions with social and natural phenomena is mirrored in the way we describe our experiences through natural language. In order to retain and convey such a high dimensional information, the statistical properties of our linguistic output has to be highly correlated in time. An example are the robust observations, still largely not understood, of correlations on arbitrary long scales in literary texts. In this paper we explain how long-range correlations flow from highly structured linguistic levels down to the building blocks of a text (words, letters, etc..). By combining calculations and data analysis we show that correlations take form of a bursty sequence of events once we approach the semantically relevant topics of the text. The mechanisms we identify are fairly general and can be equally applied to other hierarchical settings.
Another area of arXiv.org, Physics > Data Analysis, Statistics and Probability, to monitor.
The authors used ten (10) novels from Project Gutenberg:
- Alice’s Adventures in Wonderland
- The Adventures of Tom Sawyer
- Pride and Prejudice
- Life on the Mississippi
- The Jungle
- The Voyage of the Beagle
- Moby Dick; or The Whale
- Ulysses
- Don Quixote
- War and Peace
Interesting research that will take a while to digest but I have to wonder why these ten (10) novels?
Or perhaps better, in an age of “big data,” why only ten (10)?
Why not the entire corpus of Project Gutenberg?
Or perhaps the texts of Wikipedia in its multitude of languages?
Reasoning that if the results represent an insight about natural language, they should be applicable beyond English. Yes?
If this is your area, comments and suggestions would be most welcome.
Posted in Natural Language Processing, Text Analytics | No Comments »
Tuesday, May 29th, 2012
ProseVis
A tool for exploring texts on non-word basis.
Or in the words of the project:
ProseVis is a visualization tool developed as part of a use case supported by the Andrew W. Mellon Foundation through a grant titled “SEASR Services,” in which we seek to identify other features than the “word” to analyze texts. These features comprise sound including parts-of-speech, accent, phoneme, stress, tone, break index.
ProseVis allows a reader to map the features extracted from OpenMary (http://mary.dfki.de/) Text-to-speech System and predictive classification data to the “original” text. We developed this project with the ultimate goal of facilitating a reader’s ability to analyze and disseminate the results in human readable form. Research has shown that mapping the data to the text in its original form allows for the kind of human reading that literary scholars engage: words in the context of phrases, sentences, lines, stanzas, and paragraphs (Clement 2008). Recreating the context of the page not only allows for the simultaneous consideration of multiple representations of knowledge or readings (since every reader’s perspective on the context will be different) but it also allows for a more transparent view of the underlying data. If a human can see the data (the syllables, the sounds, the parts-of-speech) within the context in which they are used to reading, with the data mapped back onto the full text, then the reader is empowered within this familiar context to read what might otherwise be an unfamiliar representation tabular representation of the text. For these reasons, we developed ProseVis as a reader interface to allow scholars to work with the data in a language or context in which we are used to saying things about the world.
Textual analysis tools are “smoking gun” detectors.
CEO is unlikely to make inappropriate comments in a spreadsheet or data feed. Emails on the other hand…
Big or little data, the goal is to have the “right” data.
Posted in Data Mining, Graphics, Text Analytics, Text Mining, Visualization | No Comments »
Saturday, May 19th, 2012
From the Bin Laden Letters: Reactions in the Islamist Blogosphere
From the post:
Following our initial analysis of the Osama bin Laden letters released by the Combating Terrorism Center (CTC) at West Point, we’ll more closely examine interesting moments from the letters and size them up against what was publicly reported as happening in the world in order to gain a deeper perspective on what was known or unknown at the time.
There was a frenzy of summarization and highlight reel reporting in the wake of the Abbottabad documents being publicly released. Some focused on the idea that Osama bin Laden was ostracized, some pointed to the seeming obsession with image in the media, and others simply took a chance to jab at Joe Biden for the suggestions made about his lack of preparedness for the presidency.
What we’ll do in this post is take a different approach, and rather than focus on analyst viewpoints we’ll compare reactions to the Abbottabad documents from a unique source – Islamist discussion forums.
There we find rebukes over the veracity of the documents released, support for the efforts of operatives such as Faisal Shahzad, and a little interest in the Arab Spring.
Interesting visualizations as always.
The question I would ask as a consumer of such information services is: How do I integrate this analysis with in-house analysis tools?
Or perhaps better: How do I evaluate non-direct references to particular persons or places? That is a person or place is implied but not named. What do I know about the basis for such an identification?
Posted in Intelligence, Text Analytics | No Comments »
Monday, April 30th, 2012
Text Analytics: Yesterday, Today and Tomorrow
Another Tony Russell-Rose post that I ran across over the weekend:
Here’s something I’ve been meaning to share for a while: the slides for a talk entitled “Text Analytics: Yesterday, Today and Tomorrow”, co-authored with colleagues Vladimir Zelevinsky and Michael Ferretti. In this we outline some of the key challenges in text analytics, describe some of Endeca’s current research in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
I was amused to read on slide 40:
Solutions still not standardized
Users differ in their views of the world of texts, solutions, data, formats, data structures, and analysis.
Anyone offering a “standardized” solution is selling their view of the world.
As a user/potential customer, I am rather attached to my view of the world. You?
Posted in Marketing, Text Analytics | No Comments »
Sunday, April 29th, 2012
Prostitutes Appeal to Pope: Text Analytics applied to Search by Tony Russell-Rose.
It is hard for me to visit Tony’s site and not come away with several posts he has written that I want to mention. Today was no different.
Here is a sampling of what Tony talks about in this post:
Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:
- DRUNK GETS NINE YEARS IN VIOLIN CASE
- PROSTITUTES APPEAL TO POPE
- STOLEN PAINTING FOUND BY TREE
- RED TAPE HOLDS UP NEW BRIDGE
- DEER KILL 300,000
- RESIDENTS CAN DROP OFF TREES
- INCLUDE CHILDREN WHEN BAKING COOKIES
- MINERS REFUSE TO WORK AFTER DEATH
Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.
A very informative and highly amusing post.
What better way to start the week?
Enjoy!
Posted in Ambiguity, Search Analytics, Searching, Text Analytics | No Comments »
Sunday, April 29th, 2012
Text Analytics Summit Europe – highlights and reflections by Tony Russell-Rose.
Earlier this week I had the privilege of attending the Text Analytics Summit Europe at the Royal Garden Hotel in Kensington. Some of you may of course recognise this hotel as the base for Justin Bieber’s recent visit to London, but sadly (or is that fortunately?) he didn’t join us. Next time, maybe…
Ranking reasons to attend:
- #1 Text Analytics Summit Europe – meet other attendees, presentations
- #2 Kensington Gardens and Hyde Park (been there, it is more impressive than you can imagine)
- #N +1 Justin Bieber being in London (or any other location)
I was disappointed by the lack of links to slides or videos of the presentations.
Tony’s post does have pointers to people and resources you may have missed.
Question: Do you think “text analytics” and “data mining” are different? If so, how?
Posted in Analytics, Natural Language Processing, Text Analytics | No Comments »
Tuesday, April 17th, 2012
Superfastmatch: A text comparison tool by Donovan Hide.
Slides on a Chrome extension that compares news stories for unique content.
Would be interesting to compare 24-hour news channels both to themselves and to others on the basis of duplicate content.
Could even have a 15 minute, highlights of the news and deliver most of the non-duplicate content (well, omitting the commercials as well) for any 24-hour period.
Until then, visit this project and see what you think.
Posted in Duplicates, News, Text Analytics | No Comments »
Thursday, March 22nd, 2012
Text Analytics in Telecommunications – Part 3 by Themos Kalafatis.
From the post:
It is well known that FaceBook contains a multitude of information that can be potentially analyzed. A FaceBook page contains several entries (Posts, Photos, Comments, etc) which in turn generate Likes. This data can be analyzed to better understand the behavior of consumers towards a Brand, Product or Service.
Let’s look at the analysis of the three FaceBook pages of MT:S, Telenor and VIP Mobile Telcos in Serbia as an example. The question that this analysis tries to answer is whether we can identify words and phrases that frequently appear in posts that generate any kind of reaction (a “Like”, or a Comment) vs words and topics that do not tend to generate reactions . If we are able to differentiate these words then we get an idea on what consumers tend to value more : If a post is of no value to us then we will not tend to Like it and/or comment it.
To perform this analysis we need a list of several thousands of posts (their text) and also the number of Likes and Comments that each post has received. If any post has generated a Like and/or a Comment then we flag that post as having generated a reaction. The next step is to feed that information to a machine learning algorithm to identify which words have discriminative power (=which words appear more frequently in posts that are liked and/or commented and also which words do not produce any reaction.)
It would be more helpful if the “machine learning algorithm” used in this case was identified, along with the data set in question.
I suppose we will learn more after the presentation at the European Text Analytics Summit, although we would like to learn more sooner!
Posted in Machine Learning, Text Analytics | No Comments »
Wednesday, March 21st, 2012
Text Analytics for Telecommunications – Part 2 by Themos Kalafatis.
From the post:
In the previous post we have seen the problems that a highly inflected language creates and also a very basic example of Competitive Intelligence. The Case Study that i will present in the forthcoming European Text Analytics Summit is about the analysis of Telco Subscriber conversations on FaceBook and Twitter that involve Telenor, MT:S and VIP Mobile located in Serbia.
It is time to see what Topics are found in subscriber conversations. Each Telco has its own FaceBook page which contains posts and comments generated by page curators and subscribers. Each post and comment also generates “Likes” and “Shares”. Several types of analysis can be performed to find out :
- What kind of Topics are discussed in posts and comments of each Telco FaceBook page?
- What is the sentiment?
- Which posts (and comments) tend to be liked and shared (=generate Interest and reactions)?
Themos continues his series on text analytics for Telcos.
Here he moves into Facebook comments and analysis of the same.
Posted in Telecommunications, Text Analytics | 3 Comments »
Saturday, December 24th, 2011
RTextTools v1.3.2 Released
From the post:
RTextTools was updated to version 1.3.2 today, adding support for n-gram token analysis, a faster maximum entropy algorithm, and numerous bug fixes. The source code has been synced with the Google Code repository, so please feel free to check out a copy and add your own features!
With the core feature set of RTextTools finalized, the next major release (v1.4.0) will focus on optimizing existing code and refining the API for the package. Furthermore, my goal is to add compressed sparse matrix support for all nine algorithms to reduce memory consumption; currently maximum entropy, support vector machines, and glmnet support compressed sparse matrices.
If you are doing text analysis to extract subjects and their properties or have an interest in contributing to a project on text analysis, this may be your chance.
Posted in R, Text Analytics | No Comments »
Wednesday, December 21st, 2011
Reusable TokenStreams by Chris Male.
Abstract:
This white paper covers how Lucene’s text analysis system works today and explores the system and provides an understanding of what a TokenStream is, what the difference between Analyzers, TokenFilters and Tokenizers are, and how reuse impacts the design and implementation of each of these components.
Useful treatment of Lucene’s text analysis features. Those are still developing and more changes are promised (but left rather vague) for the future.
One feature that is covered of particular interest was the ability to associate geographic location data with terms deemed to represent locations.
Occurs to me that such a feature could also be used to annotate terms during text analysis to associate subject identifiers with those terms.
An application doesn’t have to “understand” that terms have different meanings so long as it can distinguish one from another based on annotations. (Or map them together despite different identifiers.)
Posted in Lucene, Text Analytics | No Comments »
Saturday, December 17th, 2011
Content Analysis by Michael Heise.
From the post:
Dan Katz (MSU) let me know about a beta release of new website, Legal Language Explorer, that will likely interest anyone who does content analysis as well as those looking for a neat (and, according to Jason Mazzone, addictive) toy to burn some time. The site, according to Dan, allows users: “the chance [free of charge] to search the history of the United States Supreme Court (1791-2005) for any phrase and get a frequency plot and the full text case results for that phrase.” Dan also reports that the developers hope to expand coverage beyond Supreme Court decisions in the future.
The site needs a For Amusement Only sticker. Legal language changes over time and probably no place more so than in Supreme Court decisions.
It was a standing joke in law school that the bar association sponsored the “Avoid Probate” sort of books. If you really want to incur legal fees, just try self-help. Same is true for this site. Use it to argue with your friends, settle bets during football games, etc. Don’t rely on it during night time, road side encounters with folks carrying weapons and radios to summons help. (police)
Posted in Content Analysis, Law - Sources, Legal Informatics, Text Analytics | No Comments »
Sunday, December 4th, 2011
FACTA – Finding Associated Concepts with Text Analysis
From the Quick Start Guide:
FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.
A very good example of an exploration tool that isn’t overly complex to use.
Posted in Associations, Bioinformatics, Biomedical, Concept Detection, Text Analytics | No Comments »
Monday, November 28th, 2011
New Insights from Text Analytics by Themos Kalafatis.
From the post:
“I have been trying repeatedly to solve my billing problem through customer care. I first talked with someone called Mrs Jane Doe. She said she should transfer my call to another representative from the sales department. Yet another rep from the sales department informed me that i should be talking with the Billing department instead. Unfortunately my bad experience of being transferred through various representatives was not over because the Billing department informed me that i should speak to the……”
Currently Text Analytics software will identify key elements of the above text but a very important piece of information goes unnoticed. It is the sequence of events which takes place :
(Jane Doe => Sales Dept =>Billing Dept =>…)
Is your software capturing sequences?
If not, how would you go about doing it?
And once captured, how do you represent it in a topic map?
PS: I would have isolated more segments in the sequence. How about you?
Posted in Sequence Detection, Text Analytics | No Comments »
Monday, November 21st, 2011
TextMinr
In pre-beta (can signal interest now) but:
Text Mining As A Service – Coming Soon!
What if you could incorporate state-of-the-art text mining, language processing & analytics into your apps and systems without having to learn the science or pay an arm and a leg for the software?
Soon you will be able to!
We aim to provide our text mining technology as a simple, affordable pay-as-you-go service, available through a web dashboard and a set of REST API’s.
If you already familiar with these tools and your data sets, this could be a useful convenience.
If you aren’t familiar with these tools and your data sets, this could be a recipe for disaster.
Like SurveyMonkey.
In the hands of a survey construction expert, with testing of the questions, etc., I am sure SurveyMonkey can be a very useful tool.
In the hands of management, who want to justify decisions where surveys can be used, SurveyMonkey is positively dangerous.
Ask yourself this: Why in an age of SurveyMonkey, do politicians pay pollsters big bucks?
Do you suspect there is something different from a professional pollster and SurveyMonkey?
Same distance between TextMinr and professional text analysis.
Or perhaps better, you get what you pay for.
Posted in Data Mining, Language, Text Analytics | No Comments »
Saturday, November 12th, 2011
Big Data and Text by Bill Inmon.
From the post:
Let’s take a look at big data. Corporations have discovered that there is a lot more data out there then they had ever imagined. There are log tapes, emails and tweets. There are registration records, phone records and TV log records. There are images and medical images. In short, there is an amazing amount of data.
Back in the good old days, there was just plain old transaction data. Bank teller machines. Airline reservation data. Point of sale records. We didn’t know how good we had it in those days. Why back in the good old days, a designer could create a data model and expect the data to fit reasonably well into the data model. Or the designer could define a record type to the database management system. The system would capture and store huge numbers of records that had the same structure. The only thing that was different was the content of the records.
Ah, the good old days – where there was at least a semblance of order when it came to managing and understanding data.
Take a look at the world now. There just is no structure to some of the big data types. Or if there is an order, it is well hidden. Really messing things up is the fact that much of big data is in the form of text. And text defies structure. Trying to put text into a standard database management system is like trying to put a really square peg into a really round hole.
While reading this post (only part of which appears here) it occurred to me that “unstructured data” is being used to mean data that lacks the appearance of outward semantics. That is for any database table, you can show it to a variety of users and all of them will claim to understand the meanings both explicit and implicit in the tables. At least until they are asked to merge databases together as part of a reorganization of a business operation. Then out come old notebooks, emails, guesses and questions for older staff.
True, having outward structure can help, but the divide really isn’t between structured and unstructured data. Mostly because both of them normally lack any explicit semantics.
Posted in BigData, Text Analytics | No Comments »