Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 28, 2011

New Insights from Text Analytics

Filed under: Sequence Detection,Text Analytics — Patrick Durusau @ 7:12 pm

New Insights from Text Analytics by Themos Kalafatis.

From the post:

“I have been trying repeatedly to solve my billing problem through customer care. I first talked with someone called Mrs Jane Doe. She said she should transfer my call to another representative from the sales department. Yet another rep from the sales department informed me that i should be talking with the Billing department instead. Unfortunately my bad experience of being transferred through various representatives was not over because the Billing department informed me that i should speak to the……”

Currently Text Analytics software will identify key elements of the above text but a very important piece of information goes unnoticed. It is the sequence of events which takes place :

(Jane Doe => Sales Dept =>Billing Dept =>…)

Is your software capturing sequences?

If not, how would you go about doing it?

And once captured, how do you represent it in a topic map?

PS: I would have isolated more segments in the sequence. How about you?

November 21, 2011

TextMinr

Filed under: Data Mining,Language,Text Analytics — Patrick Durusau @ 7:31 pm

TextMinr

In pre-beta (can signal interest now) but:

Text Mining As A Service – Coming Soon!

What if you could incorporate state-of-the-art text mining, language processing & analytics into your apps and systems without having to learn the science or pay an arm and a leg for the software?

Soon you will be able to!

We aim to provide our text mining technology as a simple, affordable pay-as-you-go service, available through a web dashboard and a set of REST API’s.

If you already familiar with these tools and your data sets, this could be a useful convenience.

If you aren’t familiar with these tools and your data sets, this could be a recipe for disaster.

Like SurveyMonkey.

In the hands of a survey construction expert, with testing of the questions, etc., I am sure SurveyMonkey can be a very useful tool.

In the hands of management, who want to justify decisions where surveys can be used, SurveyMonkey is positively dangerous.

Ask yourself this: Why in an age of SurveyMonkey, do politicians pay pollsters big bucks?

Do you suspect there is something different from a professional pollster and SurveyMonkey?

Same distance between TextMinr and professional text analysis.

Or perhaps better, you get what you pay for.

November 12, 2011

Big Data and Text

Filed under: BigData,Text Analytics — Patrick Durusau @ 8:44 pm

Big Data and Text by Bill Inmon.

From the post:

Let’s take a look at big data. Corporations have discovered that there is a lot more data out there then they had ever imagined. There are log tapes, emails and tweets. There are registration records, phone records and TV log records. There are images and medical images. In short, there is an amazing amount of data.

Back in the good old days, there was just plain old transaction data. Bank teller machines. Airline reservation data. Point of sale records. We didn’t know how good we had it in those days. Why back in the good old days, a designer could create a data model and expect the data to fit reasonably well into the data model. Or the designer could define a record type to the database management system. The system would capture and store huge numbers of records that had the same structure. The only thing that was different was the content of the records.

Ah, the good old days – where there was at least a semblance of order when it came to managing and understanding data.

Take a look at the world now. There just is no structure to some of the big data types. Or if there is an order, it is well hidden. Really messing things up is the fact that much of big data is in the form of text. And text defies structure. Trying to put text into a standard database management system is like trying to put a really square peg into a really round hole.

While reading this post (only part of which appears here) it occurred to me that “unstructured data” is being used to mean data that lacks the appearance of outward semantics. That is for any database table, you can show it to a variety of users and all of them will claim to understand the meanings both explicit and implicit in the tables. At least until they are asked to merge databases together as part of a reorganization of a business operation. Then out come old notebooks, emails, guesses and questions for older staff.

True, having outward structure can help, but the divide really isn’t between structured and unstructured data. Mostly because both of them normally lack any explicit semantics.

September 22, 2011

Sparse Machine Learning Methods for Understanding Large Text Corpora

Filed under: Machine Learning,Sparse Learning,Text Analytics — Patrick Durusau @ 6:30 pm

Sparse Machine Learning Methods for Understanding Large Text Corpora (pdf) by Laurent El Ghaoui, Guan-Cheng Li, Viet-An Duong, Vu Pham, Ashok Srivastava, and Kanishka Bhaduri. Status: Accepted for publication in Proc. Conference on Intelligent Data Understanding, 2011.

Abstract:

Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using sparse regression or classification; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents.

I suppose it depends on your background (mine includes a law degree and a decade of practice) but when I read:

The ASRS data contains several of the crucial challenges involved under the general banner of “large-scale text data understanding”. First, its scale is huge, and growing rapidly, making the need for automated analyses of the processed reports more crucial than ever. Another issue is that the reports themselves are far from being syntactically correct, with lots of abbreviations, orthographic and grammatical errors, and other shortcuts. Thus we are not facing a corpora with well-structured language having clearly de ned rules, as we would if we were to consider a corpus of laws or bills or any other well-redacted data set.

I thought I would fall out of my chair. I don’t think I have ever heard of a “corpus of laws or bills” being described as a “…well-redacted data set.”

There was a bill passed in the US Congress last year that despite being acted on by both Houses and who knows how many production specialists, was passed without a name.

Apologies for the digression.

From the paper:

Our paper makes the claim that sparse learning methods can be very useful to the understanding large text databases. Of course, machine learning methods in general have already been successfully applied to text classi cation and clustering, as evidenced for example by [21]. We will show that sparsity is an important added property that is a crucial component in any tool aiming at providing interpretable statistical analysis, allowing in particular efficient multi-document summarization, comparison, and visualization of huge-scale text corpora.

You will need to read the paper for the details but I think it clearly demonstrates that sparse learning methods are useful for exploring large text databases. While it may be the case that your users have a view of their data, it is equally likely that you will be called upon to mine a text database and to originate a navigation overlay for it. That will require exploring the data and developing an understanding of it.

For all the projections of need for data analysts and required technical skills, without insight and imagination, they will just be going through the motions.

(Applying sparse learning methods to new areas is an example of imagination.)

September 19, 2011

DiscoverText

Filed under: Data Mining,Text Analytics — Patrick Durusau @ 7:51 pm

DiscoverText

From the webpage:

DiscoverText helps you gain valuable insight about customers, products, employees, citizens, research data, and more through powerful text analytic methods. DiscoverText combines search, human judgments and inferences with automated software algorithms to create an active machine-learning loop.

DiscoverText is currently used for text analytics, market research, eDiscovery, FOIA processing, employee engagement analytics, health informatics, processing public comments by government agencies and university basic research.

Before I sign up for the free trial version, do you have any experience with this product? Suggested data sets that make it shine or not shine so much?

August 19, 2011

MONK

Filed under: Data Mining,Digital Library,Semantics,Text Analytics — Patrick Durusau @ 8:32 pm

MONK

From the Introduction:

The MONK Project provides access to the digitized texts described above along with tools to enable literary research through the discovery, exploration, and visualization of patterns. Users typically start a project with one of the toolsets that has been predefined by the MONK team. Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization), and these tools are applied to worksets of texts selected by the user from the MONK datastore. Worksets and results can be saved for later use or modification, and results can be exported in some standard formats (e.g., CSV files).

The public data set:

This instance of the MONK Project includes approximately 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare provided by the scholars and libraries at Northwestern University, Indiana University, the University of North Carolina at Chapel Hill, and the University of Virginia. These texts are available to all users, regardless of institutional affiliation.

Digging a bit further:

Each of these texts is normalized (using Abbot, a complex XSL stylesheet) to a TEI schema designed for analytic purposes (TEI-A), and each text has been “adorned” (using Morphadorner) with tokenization, sentence boundaries, standard spellings, parts of speech and lemmata, before being ingested (using Prior) into a database that provides Java access methods for extracting data for many purposes, including searching for objects; direct presentation in end-user applications as tables, lists, concordances, or visualizations; getting feature counts and frequencies for analysis by data-mining and other analytic procedures; and getting tokenized streams of text for working with n-gram and other colocation analyses, repetition analyses, and corpus query-language pattern-matching operations. Finally, MONK’s quantitative analytics (naive Bayesian analysis, support vector machines, Dunnings log likelihood, and raw frequency comparisons), are run through the SEASR environment.

Here’s my topic maps question: So, how do I reliably combine the results from a subfield that uses a different vocabulary than my own? For that matter, how do I discover it in the first place?

I think the MONK project is quite remarkable but lament the impending repetition of research across such a vast archive simply because it is unknown or expressed a “foreign” tongue.

July 4, 2011

OrganiK Knowledge Management System

Filed under: Filters,Indexing,Knowledge Management,Recommendation,Text Analytics — Patrick Durusau @ 6:03 pm

OrganiK Knowledge Management System (wiki)

OrganiK Knowledge Management System (homepage)

I encountered the OrganiK project while searching for something else (naturally). 😉

From the homepage:

Objectives of the Project

The aim of the OrganiK project is to research and develop an innovative knowledge management system that enables the semantic fusion of enterprise social software applications. The system accumulates information that can be exchanged among one or several collaborating companies. This enables an effective management of organisational knowledge and can be adapted to functional requirements of smaller and knowledge-intensive companies.

More info..

Main distinguishing features

The set of OrganiK KM Client Interfaces comprises of a Wiki, a Blog, a Social Bookmarking and a Search Component that together constitute a Collaborative Workspace for SME knowledge workers. Each of the components consists of a Web-based client interface and a corresponding server engine.
The components that comprise the Business Logic Layer of the OrganiK KM Server are:

  • the Recommender System,
  • the Semantic Text Analyser,
  • the Collaborative Filtering Engine
  • the Full-text Indexer

More info…

Interesting project but the latest news item dates from 2008. Not encouraging.

I checked the source code and the most recent update was August, 2010. Much more encouraging.

Have written for more recent news.

June 12, 2011

A Few Subjects Go A Long Way

Filed under: Data Analysis,Language,Linguistics,Text Analytics — Patrick Durusau @ 4:11 pm

A post by Rich Cooper (Rich AT EnglishLogicKernel DOT com) Analyzing Patent Claims demonstrates the power of small vocabularies (sets of subjects) for the analysis of patent claims.

It is a reminder that a topic map author need not identify every possible subject, but only so many of those as necessary. Other subjects abound and await other authors who wish to formally recognize them.

It is also a reminder that a topic map need only be as complex or as complete as necessary for a particular task. My topic map may not be useful for Mongolian herdsmen or even the local bank. But, the test isn’t an abstract but a practical. Does it meet the needs of its intended audience?

May 18, 2011

ICON Programming for Humanists, 2nd edition

Filed under: Data Mining,Indexing,Text Analytics,Text Extraction — Patrick Durusau @ 6:50 pm

ICON Programming for Humanists, 2nd edition

From the foreword to the first edition:

This book teaches the principles of Icon in a very task-oriented fashion. Someone commented that if you say “Pass the salt” in correct French in an American university you get an A. If you do the same thing in France you get the salt. There is an attempt to apply this thinking here. The emphasis is on projects which might interest the student of texts and language, and Icon features are instilled incidentally to this. Actual programs are exemplified and analyzed, since by imitation students can come to devise their own projects and programs to fulfill them. A number of the illustrations come naturally enough from the field of Stylistics which is particularly apt for computerized approaches.

I can’t say that the success of ICON is a recommendation for task-oriented teaching but as I recall the first edition, I thought it was effective.

Data mining of texts is an important skill in the construction of topic maps.

This is a very good introduction to that subject.

April 24, 2011

It’s All Semantic With the New Text-Processing API

Filed under: Natural Language Processing,Text Analytics — Patrick Durusau @ 5:34 pm

It’s All Semantic With the New Text-Processing API

From the post at ProgrammableWeb:

Now I don’t have a master’s degree in Natural language processing, and you just might need one to get your hands dirty with this API. I see the text-processing.com API as offering a mid-level utility for incorporation in a web app. You might take text samples from your source, feed them through the Text-Processing API and analyze those results a bit further before presenting anything to your user.

This offering appears to be the result of a one man effort. Jacob Perkins designed his API as RESTful with JSON responses. It’s free and open for the meantime, but it sounds like Perkins may polish the service a bit and start charging for access. There could be a real market here since only a handful of the 58 semantic APIs in our directory offer results at the technical level.

Interesting and not all that surprising.

Could be a good way to see if you are interested in going further with natural language processing.

April 18, 2011

Classify content with XQuery

Filed under: Classification,Text Analytics,XQuery — Patrick Durusau @ 1:40 pm

Classify content with XQuery by James R. Fuller (jim.fuller@webcomposite.com), Technical Director, Webcomposite.

Summary: With the expanding growth of semi-structured and unstructured data (XML) comes the need to categorize and classify content to make querying easier, faster, and more relevant. In this article, try several techniques using XQuery to automatically tag XML documents with content categorization based on the analysis of their content and structure.

Good article on the use of XQuery for basic text analysis and how to invoke web services while using XQuery for more sophisticated text analysis.

March 13, 2011

Text Analytics Tools and Runtime for IBM LanguageWare

Filed under: Text Analytics,Topic Maps — Patrick Durusau @ 4:26 pm

Text Analytics Tools and Runtime for IBM LanguageWare

From the website:

IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. With support for more than 20 languages, LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications. With the emerging importance of Business Intelligence and the explosion in text-based information, the need to exploit this “hidden” information has never been so great. LanguageWare technology not only provides the functionality to address this need, it also makes it easier than ever to create, manage and deploy analysis engines and their resources.

It comprises Java libraries with a large set of features and the linguistic resources that supplement them. It also comprises an easy-to-use Eclipse-based development environment for building custom text analysis applications. In a few clicks, it is possible to create and deploy UIMA (Unstructured Information Management Architecture) annotators that perform everything from simple dictionary lookups to more sophisticated syntactic and semantic analysis of texts using dictionaries, rules and ontologies.

The LanguageWare libraries provide the following non-exhaustive list of features: dictionary look-up and fuzzy look-up, lexical analysis, language identification, spelling correction, hyphenation, normalization, part-of-speech disambiguation, syntactic parsing, semantic analysis, facts/entities extraction and relationship extraction. For more details see the documentation.

The LanguageWare Resource Workbench provides a complete development environment for the building and customization of dictionaries, rules, ontologies and associated UIMA annotators. This environment removes the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. In doing so, it allows the user to focus on the concepts and relationships of interest, and to develop analyzers which extract them from text without having to write any code. The resulting application code is wrapped as UIMA annotators, which can be seamlessly plugged into any application that is UIMA-compliant.

IBM has attracted a lot of attention with its Jeopardy playing “Watson,” and that isn’t necessarily a bad thing.

Personally I am hopeful that it will spur a greater interest in both the humanities as well as CS. Humanities because CS in its absence lacks a lot of interesting problems and CS because that can result in software for the rest of us to use.

Many years ago, before CS became professional or at least as professional as it is now, there was a healthy mixture of math, engineering, humanists and what would become computer scientists in computer science projects.

This software package may be a good way to attract a better cross-section of people to a project.

Not sure if finding others for collaboration will be easier in a university setting (with sharp department lines) or in a public setting where people may be looking for projects outside of work in the public interest.

Possible project questions:

  1. Define a project where you would use these text analytic tools. (3-5 pages, no citations)
  2. What other disciplines would you involve and how would you persuade them to participate? (3-5 pages, no citations)
  3. How would you involve topic maps in your project and why? (3-5 pages, no citations)
  4. How would you use these tools to populate your topic maps? (5-7 pages, no citations)

January 19, 2011

Text Analytics: Yesterday, Today and Tomorrow

Filed under: Text Analytics — Patrick Durusau @ 9:23 am

Text Analytics: Yesterday, Today and Tomorrow by Tony Russell-Rose and his colleagues Vladimir Zelevinsky and Michael Ferretti.

Nothing particularly new but a highly entertaining account of text analytics and its increasing importance.

Part 1 you could allow managers to view without assistance.

Parts 2 and 3, well, you might better be there to provide some contextual information.

December 4, 2010

November 25, 2010

LingPipe Blog

Filed under: Data Mining,Natural Language Processing,Text Analytics — Patrick Durusau @ 11:07 am

LingPipe Blog: Natural Language Processing and Text Analytics

Blog for the LingPipe Toolkit.

If you want to move beyond hand-authored topic maps, NLP and other techniques are in your future.

Imagine using LingPipe to generate entity profiles that you then edit (or not) and market for particular data resources.

On entity profiles, see: Sig.ma.

November 24, 2010

Text Visualization for Visual Text Analytics

Filed under: Authoring Topic Maps,Text Analytics,Visualization — Patrick Durusau @ 7:32 pm

Text Visualization for Visual Text Analytics Authors: John Risch, Anne Kao, Stephen R. Poteet and Y. J. Jason Wu

Abstract:

The term visual text analytics describes a class of information analysis techniques and processes that enable knowledge discovery via the use of interactive graphical representations of textual data. These techniques enable discovery and understanding via the recruitment of human visual pattern recognition and spatial reasoning capabilities. Visual text analytics is a subclass of visual data mining / visual analytics, which more generally encompasses analytical techniques that employ visualization of non-physically-based (or “abstract”) data of all types. Text visualization is a key component in visual text analytics. While the term “text visualization” has been used to describe a variety of methods for visualizing both structured and unstructured characteristics of text-based data, it is most closely associated with techniques for depicting the semantic characteristics of the free-text components of documents in large document collections. In contrast with text clustering techniques which serve only to partition text corpora into sets of related items, these so-called semantic mapping methods also typically strive to depict detailed inter- and intra-set similarity structure. Text analytics software typically couples semantic mapping techniques with additional visualization techniques to enable interactive comparison of semantic structure with other characteristics of the information, such as publication date or citation information. In this way, value can be derived from the material in the form of multidimensional relationship patterns existing among the discrete items in the collection. The ultimate goal of these techniques is to enable human understanding and reasoning about the contents of large and complexly related text collections.

Not the latest word in the area but a useful survey of the issues that arise in text visualization.

Text visualization is important for the creation of topic maps as well as the viewing of information discovered by use of a topic map.

Questions:

  1. Update the bibliography of this paper for the techniques discussed.
  2. Are there new text visualization techniques?
  3. How would you use the techniques in this paper or newer ones, for authoring topic maps? (3-5 pages, citations)

Text Analysis with LingPipe 4. Draft 0.2

Filed under: Data Mining,Natural Language Processing,Text Analytics — Patrick Durusau @ 9:53 am

Text Analysis with LingPipe 4. Draft 0.2

Draft 0.2 is up to 363 pages.

Chapters:

  1. Getting Started
  2. Characters and Strings
  3. Regular Expressions
  4. Input and Output
  5. Handlers, Parsers, and Corpora
  6. Classifiers and Evaluation
  7. Naive Bayes Classifiers (not done)
  8. Tokenization
  9. Symbol Tables
  10. Sentence Boundary Detection (not done)
  11. Latent Dirichlet Allocation
  12. Singular Value Decomposition (not done)

Extensive annexes.

Projected to see another 1,000 or so pages. So the (not done) chapters will appear along with additional material in other chapters.

Readers welcome!

Christmas came early this year!

Questions:

  1. Class presentation demonstrating use of one of the techniques on library related data set.
  2. Compare and contrast two of the techniques on a library related data set. (Project)
  3. Annotated and updated bibliography for any chapter.

Update: Same questions as before but look at the updated version of the book (split into text processing and NLP as separate parts): LingPipe and Text Processing Books.

October 1, 2010

Tell me more, not just “more of the same”

Tell me more, not just “more of the same” Authors: Francisco Iacobelli, Larry Birnbaum, Kristian J. Hammond Keywords: dimensions of similarity, information retrieval, new information detection

Abstract:

The Web makes it possible for news readers to learn more about virtually any story that interests them. Media outlets and search engines typically augment their information with links to similar stories. It is up to the user to determine what new information is added by them, if any. In this paper we present Tell Me More, a system that performs this task automatically: given a seed news story, it mines the web for similar stories reported by different sources and selects snippets of text from those stories which offer new information beyond the seed story. New content may be classified as supplying: additional quotes, additional actors, additional figures and additional information depending on the criteria used to select it. In this paper we describe how the system identifies new and informative content with respect to a news story. We also how that providing an explicit categorization of new information is more useful than a binary classification (new/not-new). Lastly, we show encouraging results from a preliminary evaluation of the system that validates our approach and encourages further study.

If you are interested in the automatic extraction, classification and delivery of information, this article is for you.

I think there are (at least) two interesting ways for “Tell Me More” to develop:

First, persisting entity recognition with other data (such as story, author, date, etc.) in the form of associations (with appropriate roles, etc.).

Second, and perhaps more importantly, to enable users to add/correct information presented as part of a mapping of information about particular entities.

September 29, 2010

Natural Language Toolkit

Natural Language Toolkit is a set of Python modules for natural language processing and text analytics. Brought to my attention by Kirk Lowery.

Two near term tasks come to mind:

  • Feature comparison to LingPipe
  • Finding linguistic software useful for topic maps

Suggestions of other toolkits welcome!

« Newer Posts

Powered by WordPress