Text Mining « Another Word For It

January 17, 2013

MongoDB Text Search Tutorial

Filed under: MongoDB,Search Engines,Searching,Text Mining — Patrick Durusau @ 7:26 pm

MongoDB Text Search Tutorial by Alex Popescu.

From the post:

Today is the day of the experimental MongoDB text search feature. Tobias Trelle continues his posts about this feature providing some examples for query syntax (negation, phrase search)—according to the previous post even more advanced queries should be supported, filtering and projections, multiple text fields indexing, and adding details about the stemming solution used (Snowball).

Alex also has a list of his posts on the text search feature for MongoDB.

Comments Off

January 13, 2013

Taming Text is released!

Filed under: Text Analytics,Text Mining — Patrick Durusau @ 8:09 pm

Taming Text is released! by Mike McCandless.

From the post:

There’s a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don’t follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

You can see:

Or, you can do like I did, grab the source code and order the eBook (PDF) version of Taming Text.

More comments to follow!

Comments Off

December 29, 2012

Analyzing the Enron Data…

Filed under: Clustering,PageRank,Statistics,Text Analytics,Text Mining — Patrick Durusau @ 6:07 am

Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-).

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.

Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.

Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

Comments Off

December 18, 2012

Coreference Resolution: to What Extent Does it Help NLP Applications?

Filed under: Coreference Resolution,Natural Language Processing,Text Mining,Textual Entailment — Patrick Durusau @ 1:23 pm

Coreference Resolution: to What Extent Does it Help NLP Applications? by Ruslan Mitkov. (presentation – audio only)

The paper from the same conference:

Coreference Resolution: To What Extent Does It Help NLP Applications? by Ruslan Mitkov, Richard Evans, Constantin Orăsan, Iustin Dornescu, Miguel Rios. (Text, Speech and Dialogue, 15th International Conference, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings, pp. 16-27)

Abstract:

This paper describes a study of the impact of coreference resolution on NLP applications. Further to our previous study [1], in which we investigated whether anaphora resolution could be beneficial to NLP applications, we now seek to establish whether a different, but related task—that of coreference resolution, could improve the performance of three NLP applications: text summarisation, recognising textual entailment and text classification. The study discusses experiments in which the aforementioned applications were implemented in two versions, one in which the BART coreference resolution system was integrated and one in which it was not, and then tested in processing input text. The paper discusses the results obtained.

In the presentation and in the paper, Mitkov distinguishes between anaphora and coreference resolution (from the paper):

While some authors use the terms coreference (resolution) and anaphora (resolution) interchangeably, it is worth noting that they are completely distinct terms or tasks [3]. Anaphora is cohesion which points back to some previous item, with the ‘pointing back’ word or phrase called an anaphor, and the entity to which it refers, or for which it stands, its antecedent. Coreference is the act of picking out the same referent in the real world. A speciﬁc anaphor and more than one of the preceding (or following) noun phrases may be coreferential, thus forming a coreferential chain of entities which have the same referent.

I am not sure why the “real world” is necessary in: “Coreference is the act of picking out the same referent in the real world.”

For topic maps, I would shorten it to: Coreference is the act of picking out the same referent. (full stop)

The paper is a useful review of coreference systems and quite unusually, reports a negative result:

This study sought to establish whether or not coreference resolution could have a positive impact on NLP applications, in particular on text summarisation, recognising textual entailment, and text categorisation. The evaluation results presented in Section 6 are in line with previous experiments conducted both by the present authors and other researchers: there is no statistically signiﬁcant beneﬁt brought by automatic coreference resolution to these applications. In this speciﬁc study, the employment of the coreference resolution system distributed in the BART toolkit generally evokes slight but not signiﬁcant increases in performance and in some cases it even evokes a slight deterioration in the performance results of these applications. We conjecture that the lack of a positive impact is due to the success rate of the BART coreference resolution system which appears to be insufﬁcient to boost performance of the aforementioned applications.

My conjecture is topic maps can boost conference resolution enough to improve performance of NLP applications, including text summarisation, recognising textual entailment, and text categorisation.

What do you think?

How would you suggest testing that conjecture?

Comments (2)

December 13, 2012

Taming Text [Coming real soon now!]

Filed under: Lucene,Mahout,Solr,Text Mining — Patrick Durusau @ 3:14 pm

Taming Text by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris.

During a webinar today Grant said that “Taming Text” should be out in ebook form in just a week or two.

Grant is giving up the position of being the second longest running MEAP project. (He didn’t say who was first.)

Let’s all celebrate Grant and his co-authors crossing the finish line with a record number of sales!

This promises to be a real treat!

PS: Not going to put this on my wish list, too random and clumsy a process. Will just order it direct. 😉

Comments Off

November 28, 2012

Bash One-Liners Explained (series)

Filed under: Bash,Data Mining,String Matching,Text Mining — Patrick Durusau @ 10:26 am

Bash One-Liners Explained by Peteris Krumins.

The series page for posts by Peteris Krumins on Bash one-liners.

So far:

One real advantage to Bash scripts is the lack of a graphical interface to get in the way.

A real advantage with “data” files but many times “text” files as well.

Comments Off

November 22, 2012

eGIFT: Mining Gene Information from the Literature

Filed under: Curation,Genomics,Search Engines,Search Interface,Search Requirements,Text Mining — Patrick Durusau @ 10:23 am

eGIFT: Mining Gene Information from the Literature by Catalina O Tudor, Carl J Schmidt and K Vijay-Shanker.

Abstract:

Background

With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.

Results

In this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT webcite), a web-based tool that associates informative terms, called iTerms, and sentences containing them, with genes. To associate iTerms with a gene, eGIFT ranks iTerms about the gene, based on a score which compares the frequency of occurrence of a term in the gene’s literature to its frequency of occurrence in documents about genes in general. To retrieve a gene’s documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT’s information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. iTerms are grouped into different categories to facilitate a quick inspection. eGIFT also links an iTerm to sentences mentioning the term to allow users to see the relation between the iTerm and the gene. We evaluated the precision and recall of eGIFT’s iTerms for 40 genes; between 88% and 94% of the iTerms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as iTerms.

Conclusions

Our evaluations suggest that iTerms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

Website: http://biotm.cis.udel.edu/eGIFT

Another lesson for topic map authoring interfaces: Offer domain specific search capabilities.

Using a ****** search appliance is little better than a poke with a sharp stick in most domains. The user is left to their own devices to sort out ambiguities, discover synonyms, again and again.

Your search interface may report > 900,000 “hits,” but anything beyond the first 20 or so are wasted.

(If you get sick, get something that comes up in the first 20 “hits” in PubMed. Where most researchers stop.)

Comments Off

Developing a biocuration workflow for AgBase… [Authoring Interfaces]

Filed under: Bioinformatics,Biomedical,Curation,Genomics,Text Mining — Patrick Durusau @ 9:50 am

Developing a biocuration workflow for AgBase, a non-model organism database by Lakshmi Pillai, Philippe Chouvarine, Catalina O. Tudor, Carl J. Schmidt, K. Vijay-Shanker and Fiona M. McCarthy.

Abstract:

AgBase provides annotation for agricultural gene products using the Gene Ontology (GO) and Plant Ontology, as appropriate. Unlike model organism species, agricultural species have a body of literature that does not just focus on gene function; to improve efficiency, we use text mining to identify literature for curation. The first component of our annotation interface is the gene prioritization interface that ranks gene products for annotation. Biocurators select the top-ranked gene and mark annotation for these genes as ‘in progress’ or ‘completed’; links enable biocurators to move directly to our biocuration interface (BI). Our BI includes all current GO annotation for gene products and is the main interface to add/modify AgBase curation data. The BI also displays Extracting Genic Information from Text (eGIFT) results for each gene product. eGIFT is a web-based, text-mining tool that associates ranked, informative terms (iTerms) and the articles and sentences containing them, with genes. Moreover, iTerms are linked to GO terms, where they match either a GO term name or a synonym. This enables AgBase biocurators to rapidly identify literature for further curation based on possible GO terms. Because most agricultural species do not have standardized literature, eGIFT searches all gene names and synonyms to associate articles with genes. As many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene, and filtering is applied to remove abstracts that mention a gene in passing. The BI is linked to our Journal Database (JDB) where corresponding journal citations are stored. Just as importantly, biocurators also add to the JDB citations that have no GO annotation. The AgBase BI also supports bulk annotation upload to facilitate our Inferred from electronic annotation of agricultural gene products. All annotations must pass standard GO Consortium quality checking before release in AgBase.

Database URL: http://www.agbase.msstate.edu/

Another approach to biocuration. I will be posting on eGift separately but do note this is a domain specific tool.

The authors did not set out to create the universal curation tool but one suited to their specific data and requirements.

I think there is an important lesson here for semantic authoring interfaces. Word processors offer very generic interfaces but consequently little in the way of structure. Authoring annotated information requires more structure and that requires domain specifics.

Now there is an idea, create topic map authoring interfaces on top of a common skeleton, instead of hard coding interfaces as users “should” use the tool.

Comments Off

November 19, 2012

Accelerating literature curation with text-mining tools:…

Filed under: Bioinformatics,Curation,Literature,Text Mining — Patrick Durusau @ 7:35 pm

Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts by Chih-Hsuan Wei, Bethany R. Harris, Donghui Li, Tanya Z. Berardini, Eva Huala, Hung-Yu Kao and Zhiyong Lu.

Abstract:

Today’s biomedical research has become heavily dependent on access to the biological knowledge encoded in expert curated biological databases. As the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep up with the literature because manual curation is an expensive and time-consuming endeavour. Past research has suggested that computer-assisted curation can improve efficiency, but few text-mining systems have been formally evaluated in this regard. Through participation in the interactive text-mining track of the BioCreative 2012 workshop, we developed PubTator, a PubMed-like system that assists with two specific human curation tasks: document triage and bioconcept annotation. On the basis of evaluation results from two external user groups, we find that the accuracy of PubTator-assisted curation is comparable with that of manual curation and that PubTator can significantly increase human curatorial speed. These encouraging findings warrant further investigation with a larger number of publications to be annotated.

Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/

Presentation on PubTator (slides, PDF).

Hmmm, curating abstracts. That sounds like annotating subjects in documents doesn’t it? Or something very close. 😉

If we start off with a set of subjects, that eases topic map authoring because users are assisted by automatic creation of topic map machinery. Creation triggered by identification of subjects and associations.

Users don’t have to start with bare ground to build a topic map.

Clever users build (and sell) forms, frames, components and modules that serve as the scaffolding for other topic maps.

Comments Off

November 18, 2012

… text mining in the FlyBase genetic literature curation workflow

Filed under: Curation,Genomics,Text Mining — Patrick Durusau @ 5:47 pm

Opportunities for text mining in the FlyBase genetic literature curation workflow by Peter McQuilton. (Database (2012) 2012 : bas039 doi: 10.1093/database/bas039)

Abstract:

FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information.

Database URL: http://flybase.org

Would you believe that ambiguity is problem #1 and describing relationships is another one?

The most common problem encountered during curation is an ambiguous genetic entity (gene, mutant allele, transgene, etc.). This situation can arise when no unique identifier (such as a FlyBase gene identifier (FBgn) or a computed gene (CG) number for genes), or an accurate and explicit reference for a mutant or transgenic line is given. Ambiguity is a particular problem when a generic symbol/ name is used (e.g. ‘Actin’ or UAS-Notch), or when a symbol/ name is used that is a synonym for a different entity (e.g. ‘ras’ is the current FlyBase symbol for the ‘raspberry’ gene, FBgn0003204, but is often used in the literature to refer to the ‘Ras85D’ gene, FBgn0003205). A further issue is that some symbols only differ in case-sensitivity for the first character, for example, the genes symbols ‘dl’ (dorsal) and ‘Dl’ (Delta). These ambiguities can usually be resolved by searching for associated details about the entity in the article (e.g. the use of a specific mutant allele can identify the gene being discussed) or by consulting the supplemental information for additional details. Sometimes we have to do some analysis ourselves, such as performing a BLAST search using any sequence data present in the article or supplementary files or executing an in-house script to report those entities used by a specified author in previously curated articles. As a final step, if we cannot resolve a problem, we email the corresponding author for clarification. If the ambiguity still cannot be resolved, then a curator will either associate a generic/unspecified entry for that entity with the article, or else omit the entity and add a (non-public) note to the curation record explaining the situation, with the hope that future publications will resolve the issue.

One of the more esoteric problems found in curation is the fact that multiple relationships exist between the curated data types. For example, the ‘dpp^EP2232 allele’ is caused by the ‘P{EP}dpp^EP2232 insertion’ and disrupts the ‘dpp gene’. This can cause problems for text-mining assisted curation, as the data can be attributed to the wrong object due to sentence structure or the requirement of back- ground or contextual knowledge found in other parts of the article. In cases like this, detailed knowledge of the FlyBase proforma and curation rules, as well as a good knowledge of Drosophila biology, is necessary to ensure the correct proforma field is filled in. This is one of the reasons why we believe text-mining methods will assist manual curation rather than replace it in the near term.

I like the “manual curation” line. Curation is a task best performed by a sentient being.

Comments Off

A Language-Independent Approach to Keyphrase Extraction and Evaluation

Filed under: Keywords,Natural Language Processing,Text Extraction,Text Feature Extraction,Text Mining — Patrick Durusau @ 11:02 am

A Language-Independent Approach to Keyphrase Extraction and Evaluation (2010) by Mari-sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela.

Abstract:

We present Likey, a language-independent keyphrase extraction method based on statistical analysis and the use of a reference corpus. Likey has a very light-weight preprocessing phase and no parameters to be tuned. Thus, it is not restricted to any single language or language family. We test Likey having exactly the same configuration with 11 European languages. Furthermore, we present an automatic evaluation method based on Wikipedia intra-linking.

Useful approach for developing a rough-cut of keywords in documents. Keywords that may indicate a need for topics to represent subjects.

Interesting that:

Phrases occurring only once in the document cannot be selected as keyphrases.

I would have thought unique phrases would automatically qualify as keyphrases. The ranking of phrases, calculated with the reference corpus and text, excludes unique phrases, in the absence of any ratio for ranking.

That sounds like a bug and not a feature to me.

Reasoning that phrases unique to an author are unique identifications of subjects. Certainly grist for a topic map mill.

Web based demonstration: http://cog.hut.fi/likeydemo/.

Mari-Sanna Paukkeri: Contact details and publications.

Comments (1)

October 8, 2012

Layout-aware text extraction from full-text PDF of scientific articles

Filed under: PDF,Text Extraction,Text Mining — Patrick Durusau @ 9:24 am

Layout-aware text extraction from full-text PDF of scientific articles by Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy and Gully APC Burns. (Source Code for Biology and Medicine 2012, 7:7 doi:10.1186/1751-0473-7-7)

Abstract:

Background

The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.

Results

Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision¹ = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, ²commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.

Conclusions

LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.

Scanning TOCs from a variety of areas can uncover goodies like this one.

What is the most recent “unexpected” paper/result outside your “field” have you found?

Comments Off

September 11, 2012

Web Data Extraction, Applications and Techniques: A Survey

Filed under: Data Mining,Extraction,Machine Learning,Text Extraction,Text Mining — Patrick Durusau @ 5:05 am

Web Data Extraction, Applications and Techniques: A Survey by Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner.

Abstract:

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of application domains. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc application domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.

This survey aims at providing a structured and comprehensive overview of the research efforts made in the field of Web Data Extraction. The fil rouge of our work is to provide a classification of existing approaches in terms of the applications for which they have been employed. This differentiates our work from other surveys devoted to classify existing approaches on the basis of the algorithms, techniques and tools they use.

We classified Web Data Extraction approaches into categories and, for each category, we illustrated the basic techniques along with their main variants.

We grouped existing applications in two main areas: applications at the Enterprise level and at the Social Web level. Such a classification relies on a twofold reason: on one hand, Web Data Extraction techniques emerged as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. On the other hand, Web Data Extraction techniques allow for gathering a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities of analyzing human behaviors on a large scale.

We discussed also about the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

Comprehensive (> 50 pages) survey of web data extraction. Supplements and updates existing work by its focus on classifying by field of use, web data extraction.

Very likely to lead to adaptation of techniques from one field to another.

Comments Off

September 1, 2012

NEH Institute Working With Text In a Digital Age

Filed under: Humanities,Text Corpus,Text Encoding Initiative (TEI),Text Mining — Patrick Durusau @ 12:37 pm

NEH Institute Working With Text In a Digital Age

From the webpage:

The goal of this demo/sample code is to provide a platform which institute participants can use to complete an exercise to create a miniature digital edition. We will use these editions as concrete examples for discussion of decisions and issues to consider when creating digital editions from TEI XML, annotations and other related resources.

Some specific items for consideration and discussion through this exercise :

Creating identifiers for your texts.

Establishing markup guidelines and best practices.

Use of inline annotations versus standoff markup.

Dealing with overlapping hierarchies.

OAC (Open Annotation Collaboration)

Leveraging annotation tools.

Applying Linked Data concepts.

Distribution formats: optimzing for display vs for enabling data reuse.

Excellent resource!

Offers a way to learn/test digital edition skills.

You can use it as a template to produce similar materials with texts of greater interest to you.

The act of encoding asks what subjects you are going to recognize and under what conditions? Good practice for topic map construction.

Not to mention that historical editions of a text have made similar, possibly differing decisions on the same text.

Topic maps are a natural way to present such choices on their own merits, as well as being able to compare and contrast those choices.

I first saw this at The banquet of the digital scholars.

Comments Off

The banquet of the digital scholars

Filed under: Humanities,Text Corpus,Text Encoding Initiative (TEI),Text Mining — Patrick Durusau @ 10:32 am

The banquet of the digital scholars

The actual workshop title: Humanities Hackathon on editing Athenaeus and on the Reinvention of the Edition in a Digital Space

September 30, 2012 Registration Deadline

October 10-12, 2012
Universität Leipzig (ULEI) & Deutsches Archäologisches Institut (DAI) Berlin

Abstract:

The University of Leipzig will host a hackathon that addresses two basic tasks. On the one hand, we will focus upon the challenges of creating a digital edition for the Greek author Athenaeus, whose work cites more than a thousand earlier sources and is one of the major sources for lost works of Greek poetry and prose. At the same time, we use the case Athenaeus to develop our understanding of to organize a truly born-digital edition, one that not only includes machine actionable citations and variant readings but also collations of multiple print editions, metrical analyses, named entity identification, linguistic features such as morphology, syntax, word sense, and co-reference analysis, and alignment between the Greek original and one or more later translations.
…

After some details:

Overview:
The Deipnosophists (Δειπνοσοφισταί, or “Banquet of the Sophists”) by Athenaeus of Naucratis is a 3rd century AD fictitious account of several banquet conversations on food, literature, and arts held in Rome by twenty-two learned men. This complex and fascinating work is not only an erudite and literary encyclopedia of a myriad of curiosities about classical antiquity, but also an invaluable collection of quotations and text re-uses of ancient authors, ranging from Homer to tragic and comic poets and lost historians. Since the large majority of the works cited by Athenaeus is nowadays lost, this compilation is a sort of reference tool for every scholar of Greek theater, poetry, historiography, botany, zoology, and many other topics.

Athenaeus’ work is a mine of thousands of quotations, but we still lack a comprehensive survey of its sources. The aim of this “humanities hackathon” is to provide a case study for drawing a spectrum of quoting habits of classical authors and their attitude to text reuse. Athenaeus, in fact, shapes a library of forgotten authors, which goes beyond the limits of a physical building and becomes an intellectual space of human knowledge. By doing so, he is both a witness of the Hellenistic bibliographical methods and a forerunner of the modern concept of hypertext, where sequential reading is substituted by hierarchical and logical connections among words and fragments of texts. Quantity, variety, and precision of Athenaeus’ citations make the Deipnosophists an excellent training ground for the development of a digital system of reference linking for primary sources. Athenaeus’ standard citation includes (a) the name of the author with additional information like ethnic origin and literary category, (b) the title of the work, and (c) the book number (e.g., Deipn. 2.71b). He often remembers the amount of papyrus scrolls of huge works (e.g., 6.229d-e; 6.249a), while distinguishing various editions of the same comedy (e.g., 1.29a; 4.171c; 6.247c; 7.299b; 9.367f) and different titles of the same work (e.g., 1.4e).

He also adds biographical information to identify homonymous authors and classify them according to literary genres, intellectual disciplines and schools (e.g., 1.13b; 6.234f; 9.387b). He provides chronological and historical indications to date authors (e.g., 10.453c; 13.599c), and he often copies the first lines of a work following a method that probably goes back to the Pinakes of Callimachus (e.g., 1.4e; 3.85f; 8.342d; 5.209f; 13.573f-574a).

Last but not least, the study of Athenaeus’ “citation system” is also a great methodological contribution to the domain of “fragmentary literature”, since one of the main concerns of this field is the relation between the fragment (quotation) and its context of transmission. Having this goal in mind, the textual analysis of the Deipnosophists will make possible to enumerate a series of recurring patterns, which include a wide typology of textual reproductions and linguistic features helpful to identify and classify hidden quotations of lost authors.

The 21st century has “big data” in the form of sensor streams and Twitter feeds, but “complex data” in the humanities pre-dates “big data” by a considerable margin.

If you are interested in being challenged by complexity and not simply the size of your data, take a closer look at this project.

Greek is a little late to be of interest to me but there are older texts that could benefit from a similar treatment.

BTW, while you are thinking about this project/text, consider how you would merge prior scholarship, digital and otherwise, with what originates here and what follows it in the decades to come.

Comments Off

August 24, 2012

Going Beyond the Numbers:…

Filed under: Analytics,Text Analytics,Text Mining — Patrick Durusau @ 1:39 pm

Going Beyond the Numbers: How to Incorporate Textual Data into the Analytics Program by Cindi Thompson.

From the post:

Leveraging the value of text-based data by applying text analytics can help companies gain competitive advantage and an improved bottom line, yet many companies are still letting their document repositories and external sources of unstructured information lie fallow.

That’s no surprise, since the application of analytics techniques to textual data and other unstructured content is challenging and requires a relatively unfamiliar skill set. Yet applying business and industry knowledge and starting small can yield satisfying results.

Capturing More Value from Data with Text Analytics

There’s more to data than the numerical organizational data generated by transactional and business intelligence systems. Although the statistics are difficult to pin down, it’s safe to say that the majority of business information for a typical company is stored in documents and other unstructured data sources, not in structured databases. In addition, there is a huge amount of business-relevant information in documents and text that reside outside the enterprise. To ignore the information hidden in text is to risk missing opportunities, including the chance to:

Capture early signals of customer discontent.

Quickly target product deficiencies.

Detect fraud.

Route documents to those who can effectively leverage them.

Comply with regulations such as XBRL coding or redaction of personally identifiable information.

Better understand the events, people, places and dates associated with a large set of numerical data.

Track competitive intelligence.

…

To be sure, textual data is messy and poses difficulties.

But, as Cindi points out, there are golden benefits in those hills of textual data.

Comments Off

August 15, 2012

Where to start with text mining

Filed under: Digital Research,Text Mining — Patrick Durusau @ 2:40 pm

Where to start with text mining by Ted Underwood.

From the post:

This post is less a coherent argument than an outline of discussion topics I’m proposing for a workshop at NASSR2012 (a conference of Romanticists). But I’m putting this on the blog since some of the links might be useful for a broader audience. Also, we won’t really cover all this material, so the blog post may give workshop participants a chance to explore things I only gestured at in person.

In the morning I’ll give a few examples of concrete literary results produced by text mining. I’ll start the afternoon workshop by opening two questions for discussion: first, what are the obstacles confronting a literary scholar who might want to experiment with quantitative methods? Second, how do those methods actually work, and what are their limits?

I’ll also invite participants to play around with a collection of 818 works between 1780 and 1859, using an R program I’ve provided for the occasion. Links for these materials are at the end of this post.

Something to pass along to any humanities scholars that you know, who aren’t already into text mining.

I first saw this at: primer for digital humanities.

Comments Off

August 12, 2012

DATA MINING: Accelerating Drug Discovery by Text Mining of Patents

Filed under: Contest,Data Mining,Drug Discovery,Patents,Text Mining — Patrick Durusau @ 1:34 pm

DATA MINING: Accelerating Drug Discovery by Text Mining of Patents

From the contest page:

Patent documents contain important research that is valuable to the industry, business, law, and policy-making communities. Take the patent documents from the United States Patent and Trademark Office (USPTO) as examples. The structured data include: filing date, application date, assignees, UPC (US Patent Classification) codes, IPC codes, and others, while the unstructured segments include: title, abstract, claims, and description of the invention. The description of the invention can be further segmented into field of the invention, background, summary, and detailed description.

Given a set of “Source” patents or documents, we can use text mining to identify patents that are “similar” and “relevant” for the purpose of discovery of drug variants. These relevant patents could further be clustered and visualized appropriately to reveal implicit, previously unknown, and potentially useful patterns.

The eventual goal is to obtain a focused and relevant subset of patents, relationships and patterns to accelerate discovery of variations or evolutions of the drugs represented by the “source” patents.

Timeline:

July 19, 2012 – Start of the Contest Part 1

August 23, 2012 – Deadline for Submission of Onotolgy delieverables

August 24 to August 29, 2012 – Crowdsourced And Expert Evaluation for Part 1. NO SUBMISSIONS ACCEPTED for contest during this week.

Milestone 1: August 30, 2012 – Winner for Part 1 contest announced and Ontology release to the community for Contest Part 2

Aug. 31 to Sept. 21, 2012 – Contest Part 2 Begins – Data Exploration / Text Mining of Patent Data

Milestone 2: Sept. 21, 2012 – Deadline for Submission Contest Part 2. FULL CONTEST CLOSING.

Sept. 22 to Oct. 5, 2012 – Crowdsourced and Expert Evaluation for contest Part 2

Milestone 3: Oct. 5, 2012 – Conditional Winners Announcement

Possibly fertile ground for demonstrating the value of topic maps.

Particularly if you think of topic maps as curating search strategies and results.

Think about that for a moment: curating search strategies and results.

We have all asked reference librarians or other power searchers for assistance and watched while they discovered resources we didn’t imagine existed.

What if for medical expert searchers, we curate the “search request” along with the “search strategy” and the “result” of that search?

Such that we can match future search requests up with likely search strategies?

What we are capturing is the experts understanding and recognition of subjects not apparent to the average user. Capturing it in such a way as to make use of it again in the future.

If you aren’t interested in medical research, how about: Accelerating Discovery of Trolls by Text Mining of Patents? 😉

I first saw this at KDNuggets.

Update: 13 August 2012

Tweet by Lars Marius Garshol points to: Patent troll Intellectual Ventures is more like a HYDRA.

Even a low-end estimate – the patents actually recorded in the USPTO as being assigned to one of those shells – identifies around 10,000 patents held by the firm.

At the upper end of the researchers’ estimates, Intellectual Ventures would rank as the fifth-largest patent holder in the United States and among the top fifteen patent holders worldwide.

As sad as that sounds, remember this is one (1) troll. There are others.

Comments Off

August 8, 2012

BioContext: an integrated text mining system…

Filed under: Bioinformatics,Biomedical,Entity Extraction,Text Mining — Patrick Durusau @ 1:49 pm

BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events by Martin Gerner, Farzaneh Sarafraz, Casey M. Bergman, and Goran Nenadic. (Bioinformatics (2012) 28 (16): 2154-2161. doi: 10.1093/bioinformatics/bts332)

Abstract:

Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.

Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.

Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.

If you are interested in text mining by professionals, this is a good place to start.

Should be of particular interest to anyone interested in mining literature for construction of a topic map.

Comments Off

August 2, 2012

Using machine learning to extract quotes from text

Filed under: Machine Learning,Text Mining — Patrick Durusau @ 2:44 pm

Using machine learning to extract quotes from text by Chase Davis.

From the post:

Since we launched our Politics Verbatim project a couple of years ago, I’ve been hung up on what should be a simple problem: How can we automate the extraction of quotes from news articles, so it doesn’t take a squad of bored-out-of-their-minds interns to keep track of what politicians say in the news?

You’d be surprised at how tricky this is. At first glance, it looks like something a couple of regular expressions could solve. Just find the text with quotes in it, then pull out the words in between! But what about “air quotes?” Or indirect quotes (“John said he hates cheeseburgers.”)? Suffice it to say, there are plenty of edge cases that make this problem harder than it looks.

When I took over management of the combined Center for Investigative Reporting/Bay Citizen technology team a couple of months ago, I encouraged everyone to have a personal project on the back burner – an itch they wanted to scratch either during slow work days or (in this case) on nights and weekends.

This is mine: the citizen-quotes project, an app that uses simple machine learning techniques to extract more than 40,000 quotes from every article that ran on The Bay Citizen since it launched in 2010. The goal was to build something that accounts for the limitations of the traditional method of solving quote extraction – regular expressions and pattern matching. And sure enough, it does a pretty good job.

Illustrates the application of machine learning to a non-trivial text analysis problem.

Comments Off

July 26, 2012

Schema.org and One Hundred Years of Search

Filed under: Indexing,Searching,Text Mining,Web History — Patrick Durusau @ 2:13 pm

Schema.org and One Hundred Years of Search by Dan Brickley.

From the post:

Slides and video are already in the Web, but I wanted to post this as an excuse to plug the new Web History Community Group that Max and I have just started at W3C. The talk was part of the Libraries, Media and the Semantic Web meetup hosted by the BBC in March. It gave an opportunity to run through some forgotten history, linking Paul Otlet, the Universal Decimal Classification, schema.org and some 100 year old search logs from Otlet’s Mundaneum. Having worked with the BBC Lonclass system (a descendant of Otlet’s UDC), and collaborated with the Aida Slavic of the UDC on their publication of Linked Data, I was happy to be given the chance to try to spell out these hidden connections. It also turned out that Google colleagues have been working to support the Mundaneum and the memory of this early work, and I’m happy that the talk led to discussions with both the Mundaneum and Computer History Museum about the new Web History group at W3C.

Sounds like a great starting point!

But the intellectual history of indexing and search runs far deeper than one hundred years. Our current efforts are likely to profit from a deeper knowledge of our roots.

Comments Off

Network biology methods integrating biological data for translational science

Filed under: Bioinformatics,Text Mining — Patrick Durusau @ 1:35 pm

Network biology methods integrating biological data for translational science by Gurkan Bebek, Mehmet Koyutürk, Nathan D. Price, and Mark R. Chance. (Brief Bioinform (2012) 13 (4): 446-459. doi: 10.1093/bib/bbr075)

Abstract:

The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.

Integrating as well as filtering data for various modeling purposes are standard topic map fare.

Looking forward to complex integration needs driving further development of topic maps!

Comments Off

Mining the pharmacogenomics literature—a survey of the state of the art

Filed under: Bioinformatics,Genome,Pharmaceutical Research,Text Mining — Patrick Durusau @ 1:23 pm

Mining the pharmacogenomics literature—a survey of the state of the art by Udo Hahn, K. Bretonnel Cohen, and Yael Garten. (Brief Bioinform (2012) 13 (4): 460-494. doi: 10.1093/bib/bbs018)

Abstract:

This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

At thirty-six (36) pages and well over 200 references, this is going to take a while to digest.

Some questions to be thinking about while reading:

How are entity recognition issues same/different?

What techniques have you seen before? How different/same?

What other techniques would you suggest?

Comments Off

July 25, 2012

Using MySQL Full-Text Search in Entity Framework

Filed under: Full-Text Search,MySQL,Searching,Text Mining — Patrick Durusau @ 6:14 pm

Using MySQL Full-Text Search in Entity Framework

Another database/text search post not for the faint of heart.

MySQL database supports an advanced functionality of full-text search (FTS) and full-text indexing described comprehensively in the documentation:

Full-Text Search Functions (MySQL 5.5 stable release)

Full-Text Search Functions (MySQL 5.6 development release)

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the full-text search functionality in our Devart dotConnect for MySQL ADO.NET Entity Framework provider.

Hard to say why Beyond Search picked up the Oracle post but left the MySQL one hanging.

I haven’t gone out and counted noses but I suspect there are a lot more installs of MySQL than Oracle 11g. Just my guess. Don’t buy or sell stock based on my guesses.

Comments Off

Using Oracle Full-Text Search in Entity Framework

Filed under: Full-Text Search,Oracle,Searching,Text Mining — Patrick Durusau @ 4:05 pm

Using Oracle Full-Text Search in Entity Framework

From the post:

Oracle database supports an advanced functionality of full-text search (FTS) called Oracle Text, which is described comprehensively in the documentation:

Oracle® Text Application Developer’s Guide 11g

Oracle® Text Reference 11g

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the basic Oracle Text functionality in our Devart dotConnect for Oracle ADO.NET Entity Framework provider.

Just in case you run across a client using Oracle to store text data. 😉

I first saw this at Beyond Search (As Stephen implies, it is not a resource for casual data miners.)

Comments Off

July 14, 2012

Finding Structure in Text, Genome and Other Symbolic Sequences

Filed under: Genome,Statistics,Symbol,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 8:58 am

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Accurate Methods for the Statistics of Surprise and Coincidence (1993)

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

Comments Off

July 9, 2012

TUSTEP is open source – with TXSTEP providing a new XML interface

Filed under: Text Analytics,Text Mining,TUSTEP/TXSTEP,XML — Patrick Durusau @ 9:15 am

TUSTEP is open source – with TXSTEP providing a new XML interface

I won’t recount how many years ago I first received email from Wilhelm Ott about TUSTEP. 😉

From the TUSTEP homepage:

TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).

Since the title “big data” is taken, perhaps we should take “complex data” for texts.

If you are exploring textual data in any detail or with XML, you should give take a look at the TUSTEP project and its new XML interface, TXSTEP.

Or consider contributing to the project as well.

Wilhelm Ott writes (in part):

We are pleased to announce that, starting with the release 2012, TUSTEP is available as open source software. It is distributed under the Revised BSD Licence and can be downloaded from www.tustep.org.

TUSTEP has a long tradition as a highly flexible, reliable, efficient suite of programs for humanities computing. It started in the early 70ies as a tool for supporting humanities projects at the University of Tübingen, relying on own funds of the University. From 1985 to 1989, a substantial grant from the Land Baden-Württemberg officially opened its distribution beyond the limits of the University and started its success as a highly appreciated research tool for many projects at about a hundred universities and academic institutions in the German speaking part of the world, represented since 1993 in the International TUSTEP User Group (ITUG). Reports on important projects relying on TUSTEP and a list of publications (includig lexicograpic works and critical editions) can be found on the tustep webpage.

…

TXSTEP, presently being developed in cooperation with Stuttgart Media University, offers a new XML-based user interface to the TUSTEP programs. Compared to the original TUSTEP commands, we see important advantages:

it will offer an up-to-date established syntax for scripting;

it will show the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying the code;

it will offer – to a certain degree – a self teaching environment by commenting on the scope of every step;

it will help to avoid many syntactical errors, even compared to the original TUSTEP scripting environment;

the syntax is in English, providing a more widespread usability than TUSTEP’s German command language.

At the TEI conference last year in Würzburg, we presented a first prototype to an international audience. We look forward to DH2012 in Hamburg next week where, during the Poster Session, a more enhanced version which already contains most of TUSTEPs functions will be presented. A demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.

After the demo, you are invited to download a test version of TXSTEP to play with, to comment on it and to help make it a great and flexible tool for everyday – and complex – questions.

OK, I confess a fascination with complex textual analysis.

Comments Off

July 7, 2012

Natural Language Processing | Hub

Filed under: Information Retrieval,Linguistics,Machine Learning,Natural Language Processing,Text Mining — Patrick Durusau @ 3:41 pm

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at info@cilenis.com

Definitely going on my short list of sites to check!

Comments Off

June 29, 2012

National Centre for Text Mining (NaCTeM)

Filed under: Text Analytics,Text Extraction,Text Feature Extraction,Text Mining — Patrick Durusau @ 3:15 pm

National Centre for Text Mining (NaCTeM)

From the webpage:

The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to

text mining services provided by NaCTeM

software tools, both those developed by the NaCTeM team and by other text mining groups

seminars, general events, conferences and workshops

tutorials and demonstrations

text mining publications

Let us know if you would like to include any of the above in our website.

This is a real treasure trove of software, resources and other materials.

I will be working in reports on “finds” at this site for quite some time.

Comments (1)

June 26, 2012

Text Mining & R

Filed under: R,Text Mining — Patrick Durusau @ 6:48 pm

As I got deeper on Wordcloud of the Arizona et al. v. United States opinion, I ran across several resources on the tm package for text mining in R.

First, if you are in an R shell:

> library("tm") > vignette("tm")

Produces an eight (8) page overview of the package.

Next stop should be An Introduction to Text Mining in R (R News volume 8/2, 2008, pages 19-22).

Demonstrations of stylometry using the Wizard of Oz book series and analysis of email archives either as RSS feeds or in mbox format.

If you are still curious, check out Text Mining Infrastructure in R, by Ingo Feinerer, Kurt Hornik and David Meyer. Journal of Statistical Software, March 2008, Volume 25, Issue 5.

Runs a little over fifty (50) pages.

The package is reported to be designed for extension and since this paper was published in 2008, I expect there are extensions not reflected in these resources.

Suggestions/pointers quite welcome!

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 17, 2013

January 13, 2013

December 29, 2012

December 18, 2012

December 13, 2012

November 28, 2012

November 22, 2012

November 19, 2012

November 18, 2012

October 8, 2012

September 11, 2012

September 1, 2012

August 24, 2012

August 15, 2012

August 12, 2012

August 8, 2012

August 2, 2012

July 26, 2012

July 25, 2012

July 14, 2012

July 9, 2012

July 7, 2012

June 29, 2012

June 26, 2012