Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 22, 2013

Creating Data from Text…

Filed under: Data Mining,OpenRefine,Text Mining — Patrick Durusau @ 7:42 pm

Creating Data from Text – Regular Expressions in OpenRefine by Tony Hirst.

From the post:

Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).

Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.

More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)

This is a great step-by-step extraction of data example using regular expressions in OpenRefine.

If you don’t know OpenRefine, you should.

Debating possible or potential semantics is one thing.

Extracting, processing, and discovering the semantics of data is another.

In part because the latter is what most clients are willing to pay for. 😉

PS: Using OpenRefine is on sale now in eBook version for $5.00 http://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book A tweet from Packt Publishing says the sale is on through January 3, 2014.

December 21, 2013

Document visualization: an overview of current research

Filed under: Data Explorer,Graphics,Text Mining,Visualization — Patrick Durusau @ 3:13 pm

Document visualization: an overview of current research by Qihong Gan, Min Zhu, Mingzhao Li, Ting Liang, Yu Cao, Baoyao Zhou.

Abstract:

As the number of sources and quantity of document information explodes, efficient and intuitive visualization tools are desperately needed to assist users in understanding the contents and features of a document, while discovering hidden information. This overview introduces fundamental concepts of and designs for document visualization, a number of representative methods in the field, and challenges as well as promising directions of future development. The focus is on explaining the rationale and characteristics of representative document visualization methods for each category. A discussion of the limitations of our classification and a comparison of reviewed methods are presented at the end. This overview also aims to point out theoretical and practical challenges in document visualization.

The authors evaluate document visualization methods against the following goals:

  • Overview. Gain an overview of the entire collection.
  • Zoom. Zoom in on items of interest.
  • Filter. Filter out uninteresting items.
  • Details-on-demand. Select an item or group and get details when needed.
  • Relate. View relationship among items.
  • History. Keep a history of actions to support undo, replay, and progressive refinement.
  • Extract. Allow extraction of sub-collections and of the query parameters.

A useful review of tools for exploring texts!

December 9, 2013

Highlighting text in text mining

Filed under: R,Text Mining — Patrick Durusau @ 10:35 am

Highlighting text in text mining by Scott Chamberlain.

From the post:

rplos is an R package to facilitate easy search and full-text retrieval from all Public Library of Science (PLOS) articles, and we have a little feature which aren't sure if is useful or not. I don't actually do any text-mining for my research, so perhaps text-mining folks can give some feedback.

You can quickly get a lot of results back using rplos, so perhaps it is useful to quickly browse what you got. What better tool than a browser to browse? Enter highplos and highbrow. highplos uses the Solr capabilities of the PLOS search API, and lets you get back a string with the term you searched for highlighted (by default with <em> tag for italics).

The rplos package has various metric and retrieval functions in addition to its main search function.

A product of the ROpenSci project.

December 5, 2013

TextBlob: Simplified Text Processing

Filed under: Natural Language Processing,Parsing,Text Mining — Patrick Durusau @ 7:31 pm

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

….

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • JSON serialization
  • Add new models or languages through extensions
  • WordNet integration

Knowing that TextBlob plays well with NLTK is a big plus!

November 13, 2013

BioCreative Resources (and proceedings)

Filed under: Bioinformatics,Curation,Text Mining — Patrick Durusau @ 7:25 pm

BioCreative Resources (and proceedings)

From the Overview page:

The growing interest in information retrieval (IR), information extraction (IE) and text mining applied to the biological literature is related to the increasing accumulation of scientific literature (PubMed has currently (2005) over 15,000,000 entries) as well as the accelerated discovery of biological information obtained through characterization of biological entities (such as genes and proteins) using high-through put and large scale experimental techniques [1].

Computational techniques which process the biomedical literature are useful to enhance the efficient access to relevant textual information for biologists, bioinformaticians as well as for database curators. Many systems have been implemented which address the identification of gene/protein mentions in text or the extraction of text-based protein-protein interactions and of functional annotations using information extraction and text mining approaches [2].

To be able to evaluate performance of existing tools, as well as to allow comparison between different strategies, common evaluation standards as well as data sets are crucial. In the past, most of the implementations have focused on different problems, often using private data sets. As a result, it has been difficult to determine how good the existing systems were or to reproduce the results. It is thus cumbersome to determine whether the systems would scale to real applications, and what performance could be expected using a different evaluation data set [3-4].

The importance of assessing and comparing different computational methods have been realized previously by both, the bioinformatics and the NLP communities. Researchers in natural language processing (NLP) and information extraction (IE) have, for many years now, used common evaluations to accelerate their research progress, e.g., via the Message Understanding Conferences (MUCs) [5] and the Text Retrieval Conferences (TREC) [6]. This not only resulted in the formulation of common goals but also made it possible to compare different systems and gave a certain transparency to the field. With the introduction of a common evaluation and standardized evaluation metrics, it has become possible to compare approaches, to assess what techniques did and did not work, and to make progress. This progress has resulted in the creation of standard tools available to the general research community.

The field of bioinformatics also has a tradition of competitions, for example, in protein structure prediction (CASP [7]) or gene predictions in entire genomes (at the “Genome Based Gene Structure Determination” symposium held on the Wellcome Trust Genome Campus).

There has been a lot of activity in the field of text mining in biology, including sessions at the Pacific Symposium of Biocomputing (PSB [8]), the Intelligent Systems for Molecular Biology (ISMB) and European Conference on Computational Biology (ECCB) conferences [9] as well workshops and sessions on language and biology in computational linguistics (the Association of Computational Linguistics BioNLP SIGs).

A small number of complementary evaluations of text mining systems in biology have been recently carried out, starting with the KDD cup [10] and the genomics track at the TREC conference [11]. Therefore we decided to set up the first BioCreAtIvE challenge which was concerned with the identification of gene mentions in text [12], to link texts to actual gene entries, as provided by existing biological databases, [13] as well as extraction of human gene product (Gene Ontology) annotations from full text articles [14]. The success of this first challenge evaluation as well as the lessons learned from it motivated us to carry out the second BioCreAtIvE, which should allow us to monitor improvements and build on the experience and data derived from the first BioCreAtIvE challenge. As in the previous BioCreAtIvE, the main focus is on biologically relevant tasks, which should result in benefits for the biomedical text mining community, the biology and biological database community, as well as the bioinformatics community.

A gold mine of resources if you are interested in bioinformatics, curation or IR in general.

Including the BioCreative Proceedings for 2013:

BioCreative IV Proceedings vol. 1

BioCreative IV Proceedings vol. 2

October 29, 2013

Useful Unix/Linux One-Liners for Bioinformatics

Filed under: Bioinformatics,Linux OS,Text Mining,Uncategorized — Patrick Durusau @ 6:36 pm

Useful Unix/Linux One-Liners for Bioinformatics by Stephen Turner.

From the post:

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some “standardized” file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I’ve put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

What one liners do you have laying about?

For what data sets?

October 28, 2013

Theory and Applications for Advanced Text Mining

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 7:30 pm

Theory and Applications for Advanced Text Mining edited by Shigeaki Sakurai.

From the post:

Book chapters include:

  • Survey on Kernel-Based Relation Extraction by Hanmin Jung, Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song
  • Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents by Hidenao Abe
  • Text Clumping for Technical Intelligence by Alan Porter and Yi Zhang
  • A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining by Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino
  • Ontology Learning Using Word Net Lexical Expansion and Text Mining by Hiep Luong, Susan Gauch and Qiang Wang
  • Automatic Compilation of Travel Information from Texts: A Survey by Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa
  • Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques by Masaomi Kimura
  • Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools by David Campos, Sergio Matos and Jose Luis Oliveira
  • Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language by Fadoua Ataa Allah and Siham Boulaknadel

Download the book or the chapters at:

www.intechopen.com/books/theory-and-applications-for-advanced-text-mining

Is it just me or have more data mining/analysis books been appearing as open texts alongside traditional print publication? Than say five years ago?

October 4, 2013

The Irony of Obamacare:…

Filed under: Politics,Text Analytics,Text Mining — Patrick Durusau @ 3:49 pm

The Irony of Obamacare: Republicans Thought of It First by Meghan Foley.

From the post:

“An irony of the Patient Protection and Affordable Care Act (Obamacare) is that one of its key provisions, the individual insurance mandate, has conservative origins. In Congress, the requirement that individuals to purchase health insurance first emerged in Republican health care reform bills introduced in 1993 as alternatives to the Clinton plan. The mandate was also a prominent feature of the Massachusetts plan passed under Governor Mitt Romney in 2006. According to Romney, ‘we got the idea of an individual mandate from [Newt Gingrich], and [Newt] got it from the Heritage Foundation.’” – Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach

That irony led John Wilkerson of the University of Washington and his colleagues David Smith and Nick Stramp to study the legislative history of the health care reform law using a text-analysis system to understand its origins.

Scholars rely almost exclusively on floor roll call voting patterns to assess partisan cooperation in Congress, according to findings in the paper, Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach. By that standard, the Affordable Care was a highly partisan bill. Yet a different story emerges when the source of the reform’s policy is analyzed. The authors’ findings showed that a number of GOP policy ideas overlap with provisions in the Affordable Care Act: Of the 906-page law, 3 percent of the “policy ideas” used wording similar to bills sponsored by House Republicans and 8 percent used wording similar to bills sponsored by Senate Republicans.

In the paper, the authors say:

Our approach is to focus on legislative text. We assume that two bills share a policy idea when they share similar text. Of course, this raises many questions about whether similar text does actually capture shared policy ideas. This paper constitutes an early cut at the question.

The same thinking, similar text = similar ideas, permeates prior art searches on patents as well.

A more fruitful search would be of donor statements, proposals, literature for similar language/ideas.

In that regard, members of the United States Congress are just messengers.

PS: Thanks to Sam Hunting for the pointer to this article!

August 31, 2013

textfsm

Filed under: Parsing,State Machine,Text Mining — Patrick Durusau @ 6:20 pm

textfsm

From the webpage:

Python module which implements a template based state machine for parsing semi-formatted text. Originally developed to allow programmatic access to information returned from the command line interface (CLI) of networking devices.

TextFSM was developed internally at Google and released under the Apache 2.0 licence for the benefit of the wider community.

See: TextFSMHowto for details.

TextFSM looks like a useful Python module for extracting data from “semi-structured” text.

I first saw this in Nat Torkington’s Four short links: 29 August 2013.

August 17, 2013

frak

Filed under: Regex,Regexes,Text Mining — Patrick Durusau @ 1:57 pm

frak

From the webpage:

frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible.

This looks quite useful for text mining.

A large amount of which is on the near horizon.

I first saw this in Nat Torkington’s Four short links: 16 August 2013.

June 11, 2013

Orthogonal Range Searching for Text Indexing

Filed under: Indexing,Text Mining — Patrick Durusau @ 10:32 am

Orthogonal Range Searching for Text Indexing by Moshe Lewenstein.

Abstract:

Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the sux tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data structures and new algorithmic methods are continuously required. Therefore, text indexing is of utmost importance and is a very active research domain.

Orthogonal range searching, classically associated with the computational geometry community, is one of the tools that has increasingly become important for various text indexing applications. Initially, in the mid 90’s there were a couple of results recognizing this connection. In the last few years we have seen an increase in use of this method and are reaching a deeper understanding of the range searching uses for text indexing.

From the paper:

Orthogonal range searching refers to the preprocessing of a collection of points in d-dimensional space to allow queries on ranges de ned by rectangles whose sides are aligned with the coordinate axes (orthogonal).

If you are not already familiar with this area, you may find Lecture 11: Orthogonal Range Searching useful.

In a very real sense, indexing, as in a human indexer, lies at the heart of topic maps.

A human indexer recognizes synonyms, relationships represented by synonyms and distinguishes other uses of identifiers.

Topic maps are an effort to record that process so it can be followed mechanically by a calculator.

Mechanical indexing is a powerful tool in the hands of a human indexer, whether working on a traditional index or its successor, a topic map.

What type of mechanical indexing are you using?

June 4, 2013

Textual Processing of Legal Cases

Filed under: Annotation,Law - Sources,Text Mining — Patrick Durusau @ 2:05 pm

Textual Processing of Legal Cases by Adam Wyner.

A presentation on Adam’s Crowdsourced Legal Case Annotation project.

Very useful if you are interested in guidance on legal case annotation.

Of course I see the UI as using topics behind the UI’s identifications and associations between those topics.

But none of that has to be exposed to the user.

May 13, 2013

How to Build a Text Mining, Machine Learning….

Filed under: Document Classification,Machine Learning,R,Text Mining — Patrick Durusau @ 3:51 pm

How to Build a Text Mining, Machine Learning Document Classification System in R! by Timothy DAuria.

From the description:

We show how to build a machine learning document classification system from scratch in less than 30 minutes using R. We use a text mining approach to identify the speaker of unmarked presidential campaign speeches. Applications in brand management, auditing, fraud detection, electronic medical records, and more.

Well made video introduction to R and text mining.

April 27, 2013

Extracting and connecting chemical structures…

Filed under: Cheminformatics,Data Mining,Text Mining — Patrick Durusau @ 6:00 pm

Extracting and connecting chemical structures from text sources using chemicalize.org by Christopher Southan and Andras Stracz.

Abstract:

Background

Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.

Results

Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.

Conclusion

This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

April 21, 2013

PLOS Text Mining Collection

Filed under: Data Mining,Text Mining — Patrick Durusau @ 3:43 pm

The PLOS Text Mining Collection has launched!

From the webpage:

Across all realms of the sciences and beyond, the rapid growth in the number of works published digitally presents new challenges and opportunities for making sense of this wealth of textual information. The maturing field of Text Mining aims to solve problems concerning the retrieval, extraction and analysis of unstructured information in digital text, and to revolutionize how scientists access and interpret data that might otherwise remain buried in the literature.

Here PLOS acknowledges the growing body of work in the area of Text Mining by bringing together major reviews and new research studies published in PLOS journals to create the PLOS Text Mining Collection. It is no coincidence that research in Text Mining in PLOS journals is burgeoning: the widespread uptake of the Open Access publishing model developed by PLOS and other publishers now makes it easier than ever to obtain, mine and redistribute data from published texts. The launch of the PLOS Text Mining Collection complements related PLOS Collections on Open Access and Altmetrics, and further underscores the importance of the PLOS Application Programming Interface, which provides an open source interface with which to mine PLOS journal content.

The Collection is now open across the PLOS journals to all authors who wish to submit research or reviews in this area. Articles are presented below in order of publication date and new articles will be added to the Collection as they are published.

An impressive start to what promises to be a very rich resource!

I first saw this at: New: PLOS Text Mining.

March 18, 2013

…2,958 Nineteenth-Century British Novels

Filed under: Literature,Text Analytics,Text Mining — Patrick Durusau @ 10:27 am

A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method by Ryan Heuser and Long Le-Khac.

From the introduction:

The nineteenth century in Britain saw tumultuous changes that reshaped the fabric of society and altered the course of modernization. It also saw the rise of the novel to the height of its cultural power as the most important literary form of the period. This paper reports on a long-term experiment in tracing such macroscopic changes in the novel during this crucial period. Specifically, we present findings on two interrelated transformations in novelistic language that reveal a systemic concretization in language and fundamental change in the social spaces of the novel. We show how these shifts have consequences for setting, characterization, and narration as well as implications for the responsiveness of the novel to the dramatic changes in British society.

This paper has a second strand as well. This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project.

This branch of the digital humanities, the macroscopic study of cultural history, is a field that is still constructing itself. The right methods and tools are not yet certain, which makes for the excitement and difficulty of the research. We found that such decisions about process cannot be made a priori, but emerge in the messy and non-linear process of working through the research, solving problems as they arise. From this comes the odd, narrative form of this paper, which aims to present the twists and turns of this process of literary and methodological insight. We have divided the paper into two major parts, the development of the methodology (Sections 1 through 3) and the story of our results (Sections 4 and 5). In actuality, these two processes occurred simultaneously; pursuing our literary-historical questions necessitated developing new methodologies. But for the sake of clarity, we present them as separate though intimately related strands.

If this sounds far afield from mining tweets, emails, corporate documents or government archives, can you articulate the difference?

Or do we reflexively treat some genres of texts as “different?”

How useful you will find some of the techniques outlined will depend on the purpose of your analysis.

If you are only doing key-word searching, this isn’t likely to be helpful.

If on the other hand, you are attempting more sophisticated analysis, read on!

I first saw this in Nat Torkington’s Four Short Links: 18 March 2013.

March 16, 2013

Finding Shakespeare’s Favourite Words With Data Explorer

Filed under: Data Explorer,Data Mining,Excel,Microsoft,Text Mining — Patrick Durusau @ 2:07 pm

Finding Shakespeare’s Favourite Words With Data Explorer by Chris Webb.

From the post:

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

If Data Explorer is a GUI on top of M (outdated but a point of origin), it goes up in importance.

From the M link:

The Microsoft code name “M” Modeling Language, hereinafter referred to as M, is a language for modeling domains using text. A domain is any collection of related concepts or objects. Modeling domain consists of selecting certain characteristics to include in the model and implicitly excluding others deemed irrelevant. Modeling using text has some advantages and disadvantages over modeling using other media such as diagrams or clay. A goal of the M language is to exploit these advantages and mitigate the disadvantages.

A key advantage of modeling in text is ease with which both computers and humans can store and process text. Text is often the most natural way to represent information for presentation and editing by people. However, the ability to extract that information for use by software has been an arcane art practiced only by the most advanced developers. The language feature of M enables information to be represented in a textual form that is tuned for both the problem domain and the target audience. The M language provides simple constructs for describing the shape of a textual language – that shape includes the input syntax as well as the structure and contents of the underlying information. To that end, M acts as both a schema language that can validate that textual input conforms to a given language as well as a transformation language that projects textual input into data structures that are amenable to further processing or storage.

I try to not run examples using Shakespeare. I get distracted by the elegance of the text, which isn’t the point of the exercise. 😉

March 15, 2013

Document Mining with Overview:…

Filed under: Document Classification,Document Management,News,Reporting,Text Mining — Patrick Durusau @ 5:24 pm

Document Mining with Overview:… A Digital Tools Tutorial by Jonathan Stray.

The slides from the Overview presentation I mentioned yesterday.

One of the few webinars I have ever attended where nodding off was not a problem! Interesting stuff.

It is designed for the use case where there “…is too much material to read on deadline.”

A cross between document mining and document management.

A cross that hides a lot of the complexity from the user.

Definitely a project to watch.

March 14, 2013

Visualizing the Topical Structure of the Medical Sciences:…

Filed under: Medical Informatics,PubMed,Self Organizing Maps (SOMs),Text Mining — Patrick Durusau @ 2:48 pm

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach by André Skupin, Joseph R. Biberstine, Katy Börner. (Skupin A, Biberstine JR, Börner K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach. PLoS ONE 8(3): e58779. doi:10.1371/journal.pone.0058779)

Abstract:

Background

We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.

Methodology

Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.

Conclusions

Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Impressive work to say the least!

But I was just as impressed by the future avenues for research:

Controlled Vocabularies

It appears that the use of indexer-chosen keywords, including in the case of a large controlled vocabulary-MeSH terms in this study-raises interesting questions. The rank transition diagram in particular helped to highlight the fact that different vocabulary items play different roles in indexers’ attempts to characterize the content of specific publications. The complex interplay of hierarchical relationships and functional roles of MeSH terms deserves further investigation, which may inform future efforts of how specific terms are handled in computational analysis. For example, models constructed from terms occurring at intermediate levels of the MeSH hierarchy might look and function quite different from the top-level model presented here.

User-centered Studies

Future user studies will include term differentiation tasks to help us understand whether/how users can differentiate senses of terms on the self-organizing map. When a term appears prominently in multiple places, that indicates multiple senses or contexts for that term. One study might involve subjects being shown two regions within which a particular label term appears and the abstracts of several papers containing that term. Subjects would then be asked to rate each abstract along a continuum between two extremes formed by the two senses/contexts. Studies like that will help us evaluate how understandable the local structure of the map is.

There are other, equally interesting future research questions but those are the two of most interest to me.

I take this research as evidence that managing semantic diversity is going to require human effort, augmented by automated means.

I first saw this in Nat Torkington’s Four short links: 13 March 2013.

Document Mining with Overview:… [Webinar – March 15, 2013]

Filed under: News,Reporting,Text Mining — Patrick Durusau @ 9:34 am

Document Mining with Overview: A Digital Tools Tutorial

From the post:

Friday, March 15, 2013 at 2:00pm Eastern Time Enroll Now

Overview is a free tool for journalists that automatically organizes a large set of documents by topic, and displays them in an interactive visualization for exploration, tagging, and reporting. Journalists have already used it to report on FOIA document dumps, emails, leaks, archives, and social media data. In fact it will work on any set of documents that is mostly text. It integrates with DocumentCloud and can import your projects, or you can upload data directly in CSV form.

You can’t read 10,000 pages on deadline, but Overview can help you rapidly figure out which pages are the important ones — even if you’re not sure what you’re looking for.

This training event is part of a series on digital tools in partnership with the American Press Institute and The Poynter Institute, funded by the John S. and James L. Knight Foundation.

See more tools in the Digital Tools Catalog.

I have been meaning to learn more about “Overview” and this looks like a good opportunity.

February 28, 2013

Voyeur Tools: See Through Your Texts

Filed under: Text Mining,Texts,Visualization,Voyeur — Patrick Durusau @ 5:26 pm

Voyeur Tools: See Through Your Texts

From the website:

Voyeur is a web-based text analysis environment. It is designed to be user-friendly, flexible and powerful. Voyeur is part of the Hermeneuti.ca, a collaborative project to develop and theorize text analysis tools and text analysis rhetoric. This section of the Hermeneuti.ca web site provides information and documentation for users and developers of Voyeur.

What you can do with Voyeur:

  • use texts in a variety of formats including plain text, HTML, XML, PDF, RTF and MS Word
  • use texts from different locations, including URLs and uploaded files
  • perform lexical analysis including the study of frequency and distribution data; in particular
  • export data into other tools (as XML, tab separated values, etc.)
  • embed live tools into remote web sites that can accompany or complement your own content

One of the tools used in the Lincoln Logarithms project.

Lincoln Logarithms: Finding Meaning in Sermons

Filed under: MALLET,Natural Language Processing,Text Corpus,Text Mining — Patrick Durusau @ 1:31 pm

Lincoln Logarithms: Finding Meaning in Sermons

From the webpage:

Just after his death, Abraham Lincoln was hailed as a luminary, martyr, and divine messenger. We wondered if using digital tools to analyze a digitized collection of elegiac sermons might uncover patterns or new insights about his memorialization.

We explored the power and possibility of four digital tools—MALLET, Voyant, Paper Machines, and Viewshare. MALLET, Paper Machines, and Voyant all examine text. They show how words are arranged in texts, their frequency, and their proximity. Voyant and Paper Machines also allow users to make visualizations of word patterns. Viewshare allows users to create timelines, maps, and charts of bodies of material. In this project, we wanted to experiment with understanding what these tools, which are in part created to reveal, could and could not show us in a small, but rich corpus. What we have produced is an exploration of the possibilities and the constraints of these tools as applied to this collection.

The resulting digital collection: The Martyred President: Sermons Given on the Assassination of President Lincoln.

Let’s say this is not an “ahistorical” view. 😉

Good example of exploring “unstructured” data.

A first step before authoring a topic map.

February 27, 2013

An Interactive Analysis of Tolkien’s Works

Filed under: Graphics,Literature,Text Analytics,Text Mining,Visualization — Patrick Durusau @ 2:54 pm

An Interactive Analysis of Tolkien’s Works by Emil Johansson.

Description:

Being passionate about both Tolkien and data visualization creating an interactive analysis of Tolkien’s books seemed like a wonderful idea. To the left you will be able to explore character mentions and keyword frequency as well as sentiment analysis of the Silmarillion, the Hobbit and the Lord of the Rings. Information on editions of the books and methods used can be found in the about section.

There you will find:

WORD COUNT AND DENSITY
CHARACTER MENTIONS
KEYWORD FREQUENCY
COMMON WORDS
SENTIMENT ANALYSIS
CHARACTER CO-OCCURENCE
CHAPTER LENGTHS
WORD APPEARANCE
POSTERS

Truly remarkable analysis and visualization!

I suspect users of this portal don’t wonder so much about “how” is it done, but concentrate on the benefits it brings.

Does that sound like a marketing idea for topic maps?

I first saw this in the DashingD3js.com Weekly Newsletter.

February 17, 2013

Text Processing (part 1) : Entity Recognition

Filed under: Entity Extraction,Text Mining — Patrick Durusau @ 8:17 pm

Text Processing (part 1) : Entity Recognition by Ricky Ho.

From the post:

Entity recognition is commonly used to parse unstructured text document and extract useful entity information (like location, person, brand) to construct a more useful structured representation. It is one of the most common text processing to understand a text document.

I am planning to write a blog series on text processing. In this first blog of a series of basic text processing algorithm, I will introduce some basic algorithm for entity recognition.

Looking forward to this series!

February 3, 2013

Text as Data:…

Filed under: Data Analysis,Text Analytics,Text Mining,Texts — Patrick Durusau @ 6:58 pm

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts by Justin Grimmer and Brandon M. Stewart.

Abstract:

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

As a former political science major, I had to stop to read this article.

A wide ranging survey of an “exciting new area of research” but I remember content/text analysis as an undergraduate, North of forty years ago now.

True, some of the measures are new, along with better visualization techniques.

On the other hand, many of the problems of textual analysis now were the problems in textual analysis then (and before).

Highly recommended as a survey of current techniques.

A history of the “problems” of textual analysis and their resistance to various techniques will have to await another day.

February 2, 2013

Comment Visualization

Filed under: Text Mining,Texts,Visualization — Patrick Durusau @ 3:09 pm

New from Juice Labs: A visualization tool for exploring text data by Zach Gemignani.

From the post:

Today we are pleased to release another free tool on Juice Labs. The Comment visualization is the perfect way to exploring qualitative data like text survey responses, tweets, or product reviews. A few of the fun features:

  • Color comments based on a selected value
  • Filter comments using an interactive distribution chart at the top
  • Highlight the most interesting comments by selecting the flags in the upper right
  • Show the author and other contextual information about a comment

[skipping the lamest Wikipedia edits example]

Like our other free visualization tools in Juice Labs, the Comments visualization is designed for ease of use and sharing. Just drop in your own data, choose what fields you want to show as text and as values, and the visualization will immediately reflect your choices. The save button gives you a link that includes your data and settings.

Apparently the interface starts with the lamest Wikipedia edit data.

To change that, you have to scroll down to the Data window, Hover over Learn how.

I have reformatted the how-to content here:

Put any comma delimited data in this box. The first row needs to contain the column names. Then, give us some hints on how to use your data.

[Pre-set column names]

[*] Use this column as the question.

[a] Use this column as the author.

[cby] Use this column to color the comments. Should be a metric. By default, the comments will be sorted in ascending order.

[-] Sort the comments in descending order of the metric value. Can only be used with [cby]

[c] Use this column as a context.

Tip: you can combine the hints like: [c-cby]

Could be an interesting tool for quick and dirty exploration of textual content.

January 24, 2013

What tools do you use for information gathering and publishing?

Filed under: Data Mining,Publishing,Text Mining — Patrick Durusau @ 8:07 pm

What tools do you use for information gathering and publishing? by Mac Slocum.

From the post:

Many apps claim to be the pinnacle of content consumption and distribution. Most are a tangle of silly names and bad interfaces, but some of these tools are useful. A few are downright empowering.

Finding those good ones is the tricky part. I queried O’Reilly colleagues to find out what they use and why, and that process offered a decent starting point. We put all our notes together into this public Hackpad — feel free to add to it. I also went through and plucked out some of the top choices. Those are posted below.

Information gathering, however humble it may be, is the start of any topic map authoring project.

Mac asks for the tools you use every week.

Let’s not disappoint him!

January 20, 2013

silenc: Removing the silent letters from a body of text

Filed under: Graphics,Text Analytics,Text Mining,Texts,Visualization — Patrick Durusau @ 8:05 pm

silenc: Removing the silent letters from a body of text by Nathan Yau.

From the post:

During a two-week visualization course, Momo Miyazaki, Manas Karambelkar, and Kenneth Aleksander Robertsen imagined what a body of text would be without the the silent letters in silenc.

Nathan suggest it isn’t fancy on the analysis side but the views are interesting.

True enough that removing silent letters (once mapped) isn’t difficult, but the results of the technique may be more than just visually interesting.

Usage patterns of words with silent letters would be an interesting question.

Or extending the technique to remove all adjectives from a text (that would shorten ad copy).

“Seeing” text or data from a different or unexpected perspective can lead to new insights. Some useful, some less so.

But it is the job of analysis to sort them out.

Interactive Text Mining

Filed under: Annotation,Bioinformatics,Curation,Text Mining — Patrick Durusau @ 8:03 pm

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task Cecilia N. Arighi, et. al. (Database (2013) 2013 : bas056 doi: 10.1093/database/bas056)

Abstract:

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV.

Curation is an aspect of topic map authoring, albeit with the latter capturing information for later merging with other sources of information.

Definitely an article you will want to read if you are designing text mining as part of a topic map solution.

January 19, 2013

The Pacific Symposium on Biocomputing 2013 [Proceedings]

Filed under: Bioinformatics,Biomedical,Data Mining,Text Mining — Patrick Durusau @ 7:09 pm

The Pacific Symposium on Biocomputing 2013 by Will Bush.

From the post:

For 18 years now, computational biologists have convened on the beautiful islands of Hawaii to present and discuss research emerging from new areas of biomedicine. PSB Conference Chairs Teri Klein (@teriklein), Keith Dunker, Russ Altman (@Rbaltman) and Larry Hunter (@ProfLHunter) organize innovative sessions and tutorials that are always interactive and thought-provoking. This year, sessions included Computational Drug Repositioning, Epigenomics, Aberrant Pathway and Network Activity, Personalized Medicine, Phylogenomics and Population Genomics, Post-Next Generation Sequencing, and Text and Data Mining. The Proceedings are available online here, and a few of the highlights are:

See Will’s post for the highlights. Or browse the proceedings. You are almost certainly going to find something relevant to you.

Do note Will’s use of Twiiter IDs as identifiers. Unique, persistent (I assume Twitter doesn’t re-assign them), easy to access.

It wasn’t clear from Will’s post if the following image was from Biocomputing 2013 or if he stopped by a markup conference. Hard to tell. 😉

Biocomputing 2013

« Newer PostsOlder Posts »

Powered by WordPress