Text Mining « Another Word For It

January 2, 2015

Early English Books Online – Good News and Bad News

Filed under: Books,Library,Public Data,Publishing,Text Encoding Initiative (TEI),Text Mining,Texts,XML — Patrick Durusau @ 2:34 pm

Early English Books Online

The very good news is that 25,000 volumes from the Early English Books Online collection have been made available to the public!

From the webpage:

The EEBO corpus consists of the works represented in the English Short Title Catalogue I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement. Together these trace the history of English thought from the first book printed in English in 1475 through to 1700. The content covers literature, philosophy, politics, religion, geography, science and all other areas of human endeavor. The assembled collection of more than 125,000 volumes is a mainstay for understanding the development of Western culture in general and the Anglo-American world in particular. The STC collections have perhaps been most widely used by scholars of English, linguistics, and history, but these resources also include core texts in religious studies, art, women’s studies, history of science, law, and music.

Even better news from Sebastian Rahtz Sebastian Rahtz (Chief Data Architect, IT Services, University of Oxford):

The University of Oxford is now making this collection, together with Gale Cengage’s Eighteenth Century Collections Online (ECCO), and Readex’s Evans Early American Imprints, available in various formats (TEI P5 XML, HTML and ePub) initially via the University of Oxford Text Archive at http://www.ota.ox.ac.uk/tcp/, and offering the source XML for community collaborative editing via Github. For the convenience of UK universities who subscribe to JISC Historic Books, a link to page images is also provided. We hope that the XML will serve as the base for enhancements and corrections.

This catalogue also lists EEBO Phase 2 texts, but the HTML and ePub versions of these can only be accessed by members of the University of Oxford.

[Technical note]
Those interested in working on the TEI P5 XML versions of the texts can check them out of Github, via https://github.com/textcreationpartnership/, where each of the texts is in its own repository (eg https://github.com/textcreationpartnership/A00021). There is a CSV file listing all the texts at https://raw.githubusercontent.com/textcreationpartnership/Texts/master/TCP.csv, and a simple Linux/OSX shell script to clone all 32853 unrestricted repositories at https://raw.githubusercontent.com/textcreationpartnership/Texts/master/cloneall.sh

Now for the BAD NEWS:

An additional 45,000 books:

Currently, EEBO-TCP Phase II texts are available to authorized users at partner libraries. Once the project is done, the corpus will be available for sale exclusively through ProQuest for five years. Then, the texts will be released freely to the public.

Can you guess why the public is barred from what are obviously public domain texts?

Because our funding is limited, we aim to key as many different works as possible, in the language in which our staff has the most expertise.

Academic projects are supposed to fund themselves and be self-sustaining. When anyone asks about sustainability of an academic project, ask them when the last time your countries military was “self sustaining?” The U.S. has spent $2.6 trillion on a “war on terrorism” and has nothing to show for it other than dead and injured military personnel, perversion of budgetary policies, and loss of privacy on a world wide scale.

It is hard to imagine what sort of life-time access for everyone on Earth could be secured for less than $1 trillion. No more special pricing and contracts if you are in countries A to Zed. Eliminate all that paperwork for publishers and to access all you need is a connection to the Internet. The publishers would have a guaranteed income stream, less overhead from sales personnel, administrative staff, etc. And people would have access (whether used or not) to educate themselves, to make new discoveries, etc.

My proposal does not involve payments to large military contractors or subversion of legitimate governments or imposition of American values on other cultures. Leaving those drawbacks to one side, what do you think about it otherwise?

Comments Off

December 17, 2014

Leveraging UIMA in Spark

Filed under: Spark,Text Mining,UIMA — Patrick Durusau @ 5:01 pm

Leveraging UIMA in Spark by Philip Ogren.

Description:

Much of the Big Data that Spark welders tackle is unstructured text that requires text processing techniques. For example, performing named entity extraction on tweets or sentiment analysis on customer reviews are common activities. The Unstructured Information Management Architecture (UIMA) framework is an Apache project that provides APIs and infrastructure for building complex and robust text analytics systems. A typical system built on UIMA defines a collection of analysis engines (such as e.g. a tokenizer, part-of-speech tagger, named entity recognizer, etc.) which are executed according to arbitrarily complex flow control definitions. The framework makes it possible to have interoperable components in which best-of-breed solutions can be mixed and matched and chained together to create sophisticated text processing pipelines. However, UIMA can seem like a heavy weight solution that has a sprawling API, is cumbersome to configure, and is difficult to execute. Furthermore, UIMA provides its own distributed computing infrastructure and run time processing engines that overlap, in their own way, with Spark functionality. In order for Spark to benefit from UIMA, the latter must be light-weight and nimble and not impose its architecture and tooling onto Spark.

In this talk, I will introduce a project that I started called uimaFIT which is now part of the UIMA project (http://uima.apache.org/uimafit.html). With uimaFIT it is possible to adopt UIMA in a very light-weight way and leverage it for what it does best: text processing. An entire UIMA pipeline can be encapsulated inside a single function call that takes, for example, a string input parameter and returns named entities found in the input string. This allows one to call a Spark RDD transform (e.g. map) that performs named entity recognition (or whatever text processing tasks your UIMA components accomplish) on string values in your RDD. This approach requires little UIMA tooling or configuration and effectively reduces UIMA to a text processing library that can be called rather than requiring full-scale adoption of another platform. I will prepare a companion resource for this talk that will provide a complete, self-contained, working example of how to leverage UIMA using uimaFIT from within Spark.

The necessity of creating light-weight ways to bridge the gaps between applications and frameworks is a signal that every solution is trying to be the complete solution. Since we have different views of what any “complete” solution would look like, wheels are re-invented time and time again. Along with all the parts necessary to use those wheels. Resulting in a tremendous duplication of effort.

A component based approach attempts to do one thing. Doing any one thing well, is challenging enough. (Self-test: How many applications do more than one thing well? Assuming they do one thing well. BTW, for programmers, the test isn’t that other programs fail to do it any better.)

Until more demand results in easy to pipeline components, Philip’s uimaFIT is a great way to incorporate text processing from UIMA into Spark.

Enjoy!

Comments Off

December 15, 2014

Some tools for lifting the patent data treasure

Filed under: Deduplication,Patents,Record Linkage,Text Mining — Patrick Durusau @ 11:57 am

Some tools for lifting the patent data treasure by by Michele Peruzzi and Georg Zachmann.

From the post:

…Our work can be summarized as follows:

We provide an algorithm that allows researchers to find the duplicates inside Patstat in an efficient way

We provide an algorithm to connect Patstat to other kinds of information (CITL, Amadeus)

We publish the results of our work in the form of source code and data for Patstat Oct. 2011.

More technically, we used or developed probabilistic supervised machine-learning algorithms that minimize the need for manual checks on the data, while keeping performance at a reasonably high level.

The post has links for source code and data for these three papers:

A flexible, scaleable approach to the international patent “name game” by Mark Huberty, Amma Serwaah, and Georg Zachmann

In this paper, we address the problem of having duplicated patent applicants’ names in the data. We use an algorithm that efficiently de-duplicates the data, needs minimal manual input and works well even on consumer-grade computers. Comparisons between entries are not limited to their names, and thus this algorithm is an improvement over earlier ones that required extensive manual work or overly cautious clean-up of the names.

A scaleable approach to emissions-innovation record linkage by Mark Huberty, Amma Serwaah, and Georg Zachmann

PATSTAT has patent applications as its focus. This means it lacks important information on the applicants and/or the inventors. In order to have more information on the applicants, we link PATSTAT to the CITL database. This way the patenting behaviour can be linked to climate policy. Because of the structure of the data, we can adapt the deduplication algorithm to use it as a matching tool, retaining all of its advantages.

Remerge: regression-based record linkage with an application to PATSTAT by Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

We further extend the information content in PATSTAT by linking it to Amadeus, a large database of companies that includes financial information. Patent microdata is now linked to financial performance data of companies. This algorithm compares records using multiple variables, learning their relative weights by asking the user to find the correct links in a small subset of the data. Since it is not limited to comparisons among names, it is an improvement over earlier efforts and is not overly dependent on the name-cleaning procedure in use. It is also relatively easy to adapt the algorithm to other databases, since it uses the familiar concept of regression analysis.

Record linkage is a form of merging that originated in epidemiology in the late 1940’s. To “link” (read merge) records across different formats, records were transposed into a uniform format and “linking” characteristics chosen to gather matching records together. A very powerful technique that has been in continuous use and development ever since.

One major different with topic maps is that record linkage has undisclosed subjects, that is the subjects that make up the common format and the association of the original data sets with that format. I assume in many cases the mapping is documented but it doesn’t appear as part of the final work product, thereby rendering the merging process opaque and inaccessible to future researchers. All you can say is “…this is the data set that emerged from the record linkage.”

Sufficient for some purposes but if you want to reduce the 80% of your time that is spent munging data that has been munged before, it is better to have the mapping documented and to use disclosed subjects with identifying properties.

Having said all of that, these are tools you can use now on patents and/or extend them to other data sets. The disambiguation problems addressed for patents are the common ones you have encountered with other names for entities.

If a topic map underlies your analysis, the less time you will spend on the next analysis of the same information. Think of it as reducing your intellectual overhead in subsequent data sets.

Income – Less overhead = Greater revenue for you. 😉

PS: Don’t be confused, you are looking for EPO Worldwide Patent Statistical Database (PATSTAT). Naturally there is a US organization, http://www.patstats.org/ that is just patent litigation statistics.

PPS: Sam Hunting, the source of so many interesting resources, pointed me to this post.

Comments Off

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data

Filed under: News,Reporting,Text Analytics,Text Mining — Patrick Durusau @ 10:31 am

Sony Pictures Demands That News Agencies Delete ‘Stolen’ Data by Michael Cieply and Brooks Barnes.

From the article:

Sony Pictures Entertainment warned media outlets on Sunday against using the mountains of corporate data revealed by hackers who raided the studio’s computer systems in an attack that became public last month.

In a sharply worded letter sent to news organizations, including The New York Times, David Boies, a prominent lawyer hired by Sony, characterized the documents as “stolen information” and demanded that they be avoided, and destroyed if they had already been downloaded or otherwise acquired.

The studio “does not consent to your possession, review, copying, dissemination, publication, uploading, downloading or making any use” of the information, Mr. Boies wrote in the three-page letter, which was distributed Sunday morning.

Since I wrote about the foolish accusations against North Korea by Sony, I thought it only fair to warn you that the idlers at Sony have decided to threaten everyone else.

A rather big leap from trash talking about North Korea to accusing the rest of the world of being interested in their incestuous bickering.

I certainly don’t want a copy of their movies, released or unreleased. Too much noise and too little signal for the space they would take. But, since Sony has gotten on its “let’s threaten everybody” hobby-horse, I do hope the location of the Sony documents suddenly appears in many more inboxes. patrick@durusau.net. 😉

How would you display choice snippets and those who uttered them when a webpage loads?

The bitching and catching by Sony are sure signs that something went terribly wrong internally. The current circus is an attempt to distract the public from that failure. Probably a member of management with highly inappropriate security clearance because “…they are important!”

Inappropriate security clearances for management to networks is a sign of poor systems administration. I wonder when that shoe is going to drop?

Comments Off

December 7, 2014

Missing From Michael Brown Grand Jury Transcripts

Filed under: Ferguson,Skepticism,Text Mining — Patrick Durusau @ 7:40 am

What’s missing from the Michael Brown grand jury transcripts? Index pages. For 22 out of 24 volumes of grand jury transcripts, the index page is missing. Here’s the list:

volume 1 – page 4 missing
volume 2 – page 4 missing
volume 3 – page 4 missing
volume 4 – page 4 missing
volume 5 – page 4 missing
volume 6 – page 4 missing
volume 7 – page 4 missing
volume 8 – page 4 missing
volume 9 – page 4 missing
volume 10 – page 4 missing
volume 11 – page 4 missing
volume 12 – page 4 missing
volume 13 – page 4 missing
volume 14 – page 4 missing
volume 15 – page 4 missing
volume 16 – page 4 missing
volume 17 – page 4 missing
volume 18 – page 4 missing
volume 19 – page 4 missing
volume 20 – page 4 missing
volume 21 – page 4 present
volume 22 – page 4 missing
volume 23 – page 4 missing
volume 24 – page 4 present

As you can see from the indexes in volumes 21 and 24, they not terribly useful but better than combing twenty-four volumes (4799 pages of text) to find where a witness testifies.

Someone (court reporter?) made a conscious decision to take action that makes the transcripts harder to user.

Perhaps this is, as they say, “chance.”

Stay tuned for posts later this week that upgrade that to “coincidence” and beyond.

Comments Off

November 25, 2014

Documents Released in the Ferguson Case

Filed under: Data Mining,Ferguson,Text Mining — Patrick Durusau @ 4:15 pm

Documents Released in the Ferguson Case (New York Times)

The New York Times has posted the following documents from the Ferguson case:

24 Volumes of Grand Jury Testimony
30 Interviews of Witnesses by Law Enforcement Officials
23 Forensic and Other Reports
254 Photographs

Assume you are interested in organizing these materials for rapid access and cross-linking between them.

What are your requirements?

Accessing Grand Jury Testimony by volume and page number?
Accessing Interviews of Witnesses by report and page number?
Linking people to reports, testimony and statements?
Linking comments to particular photographs?
Linking comments to a timeline?
Linking Forensic reports to witness statements and/or testimony?
Linking physical evidence into witness statements and/or testimony?
Others?

It’s a lot of material so which requirements, these or others, would be your first priority?

It’s not a death march project but on the other hand, you need to get the most valuable tasks done first.

Suggestions?

Comments Off

November 19, 2014

Mining Idioms from Source Code

Filed under: Programming,Text Mining — Patrick Durusau @ 3:52 pm

Mining Idioms from Source Code by Miltiadis Allamanis and Charles Sutton.

Abstract:

We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic role. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present HAGGIS, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply HAGGIS to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.

A deeply interesting paper that identifies code idioms without the idioms being specified in advance.

Opens up a path to further investigation of programming idioms and annotation of such idioms.

I first saw this in: Mining Idioms from Source Code – Miltiadis Allamanis a review of a presentation by Felienne Hermans.

Comments Off

October 10, 2014

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Filed under: Cheminformatics,Chemistry,Corpora,Text Corpus,Text Mining — Patrick Durusau @ 8:37 am

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining by Saber A. Akhondi, et al. (Published: September 30, 2014 DOI: 10.1371/journal.pone.0107477)

Abstract:

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

Highly recommended both as a “gold standard” for chemical patent text mining but also as the state of the art in developing such a standard.

To say nothing of annotation as a means of automatic creation of topic maps where entities are imbued with subject identity properties.

I first saw this in a tweet by ChemConnector.

Comments Off

October 3, 2014

Open Challenges for Data Stream Mining Research

Filed under: BigData,Data Mining,Data Streams,Text Mining — Patrick Durusau @ 4:58 pm

Open Challenges for Data Stream Mining Research, SIGKDD Explorations, Volume 16, Number 1, June 2014.

Abstract:

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, over-looking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

Under entity stream mining, the authors describe the challenge of aggregation:

The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each time point t the information available on it from the other streams? What information should be stored for each entity? How to deal with differences in the speeds of individual streams? How to learn over the streams efficiently? Answering those questions in a seamless way would allow us to deploy conventional stream mining methods for entity stream mining after aggregation.
…

Sounds remarkably like an issue for topic maps doesn’t it? Well, not topic maps in the sense that every entity has an IRI subjectIdentifier but in the sense that merging rules define the basis on which two or more entities are considered to represent the same subject.

The entire issue is on “big data” and if you are looking for research “gaps,” it is a great starting point. Table of Contents: SIGKDD explorations, Volume 16, Number 1, June 2014.

I included the TOC link because for reasons only known to staff at the ACM, the articles in this issue don’t show up in the library index. One of the many “features” of the ACM Digital Library.

In addition to the committee which oversees the Digital Library being undisclosed to members and available for contact only by staff.

Comments Off

September 10, 2014

QPDF – PDF Transformations

Filed under: PDF,Text Mining — Patrick Durusau @ 9:48 am

QPDF – PDF Transformations

From the webpage:

QPDF is a command-line program that does structural, content-preserving transformations on PDF files. It could have been called something like pdf-to-pdf. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work.

QPDF is capable of creating linearized (also known as web-optimized) files and encrypted files. It is also capable of converting PDF files with object streams (also known as compressed objects) to files with no compressed objects or to generate object streams from files that don’t have them (or even those that already do). QPDF also supports a special mode designed to allow you to edit the content of PDF files in a text editor….

Government agencies often publish information in PDF. PDF which often has restrictions on copying and printing.

I have briefly tested QPDF and it does take care of copying and printing restrictions. Be aware that QPDF has many other capabilities as well.

Comments Off

September 3, 2014

Data Sciencing by Numbers:…

Filed under: Data Mining,Text Analytics,Text Mining — Patrick Durusau @ 3:28 pm

Data Sciencing by Numbers: A Walk-through for Basic Text Analysis by Jason Baldridge.

From the post:

My previous post “Titillating Titles: Uncoding SXSW Proposal Titles with Text Analytics” discusses a simple exploration I did into algorithmically rating SXSW titles, most of which I did while on a plane trip last week. What I did was pretty basic, and to demonstrate that, I’m following up that post with one that explicitly shows you how you can do it yourself, provided you have access to a Mac or Unix machine.

There are three main components to doing what I did for the blog post:

Topic modeling code: the Mallet toolkit’s implementation of Latent Dirichlet Allocation

Language modeling code: the BerkeleyLM Java package for training and using n-gram language models

Unix command line tools for processing raw text files with standard tools and the topic modeling and language modeling code

I’ll assume you can use the Unix command line at at least a basic level, and I’ve packaged up the topic modeling and language modeling code in the Github repository maul to make it easy to try them out. To keep it really simple: you can download the Maul code and then follow the instructions in the Maul README. (By the way, by giving it the name “maul” I don’t want to convey that it is important or anything — it is just a name I gave the repository, which is just a wrapper around other people’s code.)

…

Jason’s post should help get you starting doing data exercises. It is up to you if you continue those exercises and branch out to other data and new tools.

Like everything else, data exploration proficiency requires regular exercise.

Are you keeping a data exercise calendar?

I first saw this in a post by Jason Baldridge.

Comments Off

August 27, 2014

You Say “Concepts” I Say “Subjects”

Filed under: Authoring Topic Maps,Concept Detection,Subject Recognition,Text Analytics,Text Mining — Patrick Durusau @ 4:38 pm

Researchers are cracking text analysis one dataset at a time by Derrick Harris.

From the post:

Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

As explained in a blog post, the company analyzed the New York Times Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.
…

A summary of some of the recent work on recognizing concepts in text and not just key words.

As topic mappers know, there is no universal one to one correspondence between words and subjects (“concepts” in this article). Finding “concepts” means that whatever words triggered that recognition, we can supply other information that is known about the same concept.

Certainly will make topic map authoring easier when text analytics can generate occurrence data and decorate existing topic maps with their findings.

Comments Off

July 19, 2014

Ad-hoc Biocuration Workflows?

Filed under: Bioinformatics,Text Mining — Patrick Durusau @ 6:54 pm

Text-mining-assisted biocuration workflows in Argo by Rafal Rak, et al. (Database (2014) 2014 : bau070 doi: 10.1093/database/bau070)

Abstract:

Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced.

Database URL: http://argo.nactem.ac.uk

From the introduction:

Data curation from biomedical literature had been traditionally carried out as an entirely manual effort, in which a curator handpicks relevant documents and creates annotations for elements of interest from scratch. To increase the efficiency of this task, text-mining methodologies have been integrated into curation pipelines. In curating the Biomolecular Interaction Network Database (1), a protein–protein interaction extraction system was used and was shown to be effective in reducing the curation work-load by 70% (2). Similarly, a usability study revealed that the time needed to curate FlyBase records (3) was reduced by 20% with the use of a gene mention recognizer (4). Textpresso (5), a text-mining tool that marks up biomedical entities of interest, was used to semi-automatically curate mentions of Caenorhabditis elegans proteins from the literature and brought about an 8-fold increase in curation efficiency (6). More recently, the series of BioCreative workshops (http://www.biocreative.org) have fostered the synergy between biocuration efforts and text-mining solutions. The user-interactive track of the latest workshop saw nine Web-based systems featuring rich graphical user interfaces designed to perform text-mining-assisted biocuration tasks. The tasks can be broadly categorized into the selection of documents for curation, the annotation of mentions of relevant biological entities in text and the annotation of interactions between biological entities (7).

Argo is a truly impressive text-mining-assisted biocuration application but the first line of a biocuration article needs to read:

Data curation from biomedical literature had been traditionally carried out as an entirely ad-hoc effort, after the author has submitted their paper for publication.

There is an enormous backlog of material that desperately needs biocuration and Argo (and other systems) have a vital role to play in that effort.

However, the situation of ad-hoc biocuration is never going to improve unless and until biocuration is addressed in the authoring of papers to appear in biomedical literature.

Who better to answer questions or ambiguities that appear in biocuration than the author of papers?

That would require working to extend MS Office and Apache OpenOffice, to name two of the more common authoring platforms.

But the return would be higher quality publications earlier in the publication cycle, which would enable publishers to provide enhanced services based upon higher quality products and enhance tracing and searching of the end products.

No offense to ad-hoc efforts but higher quality sooner in the publication process seems like an unbeatable deal.

Comments Off

July 4, 2014

The Restatement Project

Filed under: Law,Law - Sources,Text Mining — Patrick Durusau @ 4:15 pm

Rough Consensus, Running Standards: The Restatement Project by Jason Boehmig, Tim Hwang, and Paul Sawaya.

From part 3:

Supported by a grant from the Knight Foundation Prototype Fund, Restatement is a simple, rough-and-ready system which automatically parses legal text into a basic machine-readable JSON format. It has also been released under the permissive terms of the MIT License, to encourage active experimentation and implementation.

The concept is to develop an easily-extensible system which parses through legal text and looks for some common features to render into a standard format. Our general design principle in developing the parser was to begin with only the most simple features common to nearly all legal documents. This includes the parsing of headers, section information, and “blanks” for inputs in legal documents like contracts. As a demonstration of the potential application of Restatement, we’re also designing a viewer that takes documents rendered in the Restatement format and displays them in a simple, beautiful, web-readable version.

I skipped the sections justifying the project because in my circles, the need for text mining is presumed and the interesting questions are about the text and/or the techniques for mining.

As you might suspect, I have my doubts about using JSON for legal texts but for a first cut, let’s hope the project is successful. There is always time to convert to a more robust format at some later point, in response to a particular need.

Definitely a project to watch or assist if you are considering creating a domain specific conversion editor.

Comments Off

July 3, 2014

rplos Tutorial

Filed under: R,Science,Text Mining — Patrick Durusau @ 2:14 pm

rplos Tutorial

From the webpage:

The rplos package interacts with the API services of PLoS (Public Library of Science) Journals. In order to use rplos, you need to obtain your own key to their API services. Instruction for obtaining and installing keys so they load automatically when you launch R are on our GitHub Wiki page Installation and use of API keys.

This tutorial will go through three use cases to demonstrate the kinds
of things possible in rplos.

Search across PLoS papers in various sections of papers

Search for terms and visualize results as a histogram OR as a plot through time

Text mining of scientific literature

Another source of grist for your topic map mill!

Comments Off

July 2, 2014

Verticalize

Filed under: Bioinformatics,Data Mining,Text Mining — Patrick Durusau @ 3:05 pm

Verticalize by Pierre Lindenbum.

From the webpage:

Simple tool to verticalize text delimited files.

Pierre works in bioinformatics and is the author of many useful tools.

Definitely one for the *nix toolbox.

Comments Off

Testing LDA

Filed under: Latent Dirichlet Allocation (LDA),Text Mining,Tweets — Patrick Durusau @ 2:12 pm

Using Latent Dirichlet Allocation to Categorize My Twitter Feed by Joseph Misiti.

From the post:

Over the past 3 years, I have tweeted about 4100 times, mostly URLS, and mostly about machine learning, statistics, big data, etc. I spent some time this past weekend seeing if I could categorize the tweets using Latent Dirichlet Allocation. For a great introduction to Latent Dirichlet Allocation (LDA), you can read the following link here. For the more mathematically inclined, you can read through this excellent paper which explains LDA in a lot more detail.

The first step to categorizing my tweets was pulling the data. I initially downloaded and installed Twython and tried to pull all of my tweets using the Twitter API, but that quickly realized there was an archive button under settings. So I stopped writing code and just double clicked the archive button. Apparently 4100 tweets is fairly easy to archive, because I received an email from Twitter within 15 seconds with a download link.
…

When you read Joseph’s post, note that he doesn’t use the content of his tweets but rather the content of the URLs he tweeted as the subject of the LDA analysis.

Still a valid corpus for LDA analysis but I would not characterize it as “categorizing” his tweet feed, meaning the tweets, but rather “categorizing” the content he tweeted about. Not the same thing.

A useful exercise because it uses LDA on a corpus with which you should be familiar, the materials you tweeted about.

As opposed to using LDA on a corpus that is less well known to you and you are reduced to running sanity checks with no real feel for the data.

It would be an interesting exercise, to discover the top topics for the corpus you tweeted about (Joseph’s post) and also for the corpus of #tags that you used in your tweets. Are they the same or different?

I first saw this in a tweet by Christophe Lalanne.

Comments Off

June 21, 2014

Egas:…

Filed under: Bioinformatics,Biomedical,Medical Informatics,Text Mining — Patrick Durusau @ 7:42 pm

Egas: a collaborative and interactive document curation platform by David Campos, el al.

Abstract:

With the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. As manual annotation is a demanding and expensive task, computerized solutions were developed to perform such tasks automatically. However, high-end information extraction techniques are still not widely used by biomedical research communities, mainly because of the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks. This article presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic in-line annotation of concepts and relations. A comprehensive set of de facto standard knowledge bases are integrated and indexed to provide straightforward concept normalization features. Real-time collaboration and conversation functionalities allow discussing details of the annotation task as well as providing instant feedback of curator’s interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein–protein interactions described in PubMed abstracts related to neuropathological disorders. When evaluated by expert curators, it obtained positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases and annotated resources.

Database URL: http://bioinformatics.ua.pt/egas

Read this article and/or visit the webpage and tell me this doesn’t have topic map editor written all over it!

Domain specific to be sure but any decent interface for authoring topic maps is going to be domain specific.

Very, very impressive!

I am following up with the team to check on the availability of the software.

Comments (1)

May 31, 2014

May 15, 2014

(String/text processing)++:…

Filed under: String Matching,Text Feature Extraction,Text Mining,Unicode — Patrick Durusau @ 2:49 pm

(String/text processing)++: stringi 0.2-3 released by Marek Gągolewski.

From the post:

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

…

Here is a very general list of the most important features available in the current version of stringi:

string searching:

with ICU (Java-like) regular expressions,

ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),

very fast, locale-independent byte-wise pattern matching;

joining and duplicating strings;

extracting and replacing substrings;

string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);

text transliteration;

text collation (comparing, sorting);

text boundary analysis (e.g. for extracting individual words);

random string generation;

Unicode normalization;

character encoding conversion and detection;

and many more.

Interesting isn’t it? How CS keeps circling around back to strings?

Enjoy!

Comments Off

May 13, 2014

Text Coherence

Filed under: Text Analytics,Text Coherence,Text Mining,Topic Models (LDA) — Patrick Durusau @ 2:12 pm

Christopher Phipps mentioned Automatic Evaluation of Text Coherence: Models and Representations by Mirella Lapata and Regina Barzilay in a tweet today. Running that article down, I discovered it was published in the proceedings of International Joint Conferences on Artificial Intelligence in 2005.

Useful but a bit dated.

A more recent resource: A Bibliography of Coherence and Cohesion, Wolfram Bublitz (Universität Augsburg). Last updated: 2010.

The Bublitz bibliography is more recent but current bibliography would be even more useful.

Can you suggest a more recent bibliography on text coherence/cohesion?

I ask because while looking for such a bibliography, I encountered: Improving Topic Coherence with Regularized Topic Models by David Newman, Edwin V. Bonilla, and, Wray Buntine.

The abstract reads:

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To over-come this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

I don’t think the “…small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful” is a surprise to anyone. I take that as the traditional “garbage in, garbage out.”

However, “regularizers” may be useful for automatic/assisted authoring of topics in the topic map sense of the word topic. Assuming you want to mine “small or small and noisy texts.” The authors say the technique should apply to large texts and promise future research on applying “regularizers” to large texts.

I checked the authors’ recent publications but didn’t see anything I would call a “large” text application of “regularizers.” Open area of research if you want to take the lead.

Comments Off

May 12, 2014

GATE 8.0

Filed under: Annotation,Linguistics,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 2:34 pm

GATE (general architecture for text engineering) 8.0

From the download page:

Release 8.0 (May 11th 2014)

Most users should download the installer package (~450MB):

Generic installer for any platform – this is an executable JAR file that should run when you double click it, if this fails run it from the command line using java -jar gate-8.0-build4825-installer.jar

Platform-specific installer for Windows

If the installer does not work for you, you can download one of the following packages instead. See the user guide for installation instructions:

gate-8.0-build4825-BIN.zip is a binary-only package.
[Around 434MB]

gate-8.0-build4825-SRC.zip is a source package.
[Around 409MB]

gate-8.0-build4825-DOC.zip is a documentation-only package.
[Around 30MB]

gate-8.0-build4825-ALL.zip is binary + source + docs
package. [Around 481MB]

The BIN, SRC and ALL packages all include the full set of GATE plugins and all the libraries GATE requires to run, including sample trained models for the LingPipe and OpenNLP plugins.

Version 8.0 requires Java 7 or 8, and Mac users must install the full JDK, not just the JRE.

Four major changes in this release:

Requires Java 7 or later to run
Tools for Twitter.
ANNIE (named entity annotation pipeline) Refreshed.
Tools for Crowd Sourcing.

Not bad for a project that will turn twenty (20) next year!

More resources:

UsersGuide

Nightly Snapshots

Mastering a substantial portion of GATE should keep you in nearly constant demand.

Comments Off

April 18, 2014

Saving Output of nltk Text.Concordance()

Filed under: Concordance,NLTK,Text Mining — Patrick Durusau @ 4:15 pm

Saving Output of NLTK Text.Concordance() by Kok Hua.

From the post:

In NLP, sometimes users would like to search for series of phrases that contain particular keyword in a passage or web page.

NLTK provides the function concordance() to locate and print series of phrases that contain the keyword. However, the function only print the output. The user is not able to save the results for further processing unless redirect the stdout.

Below function will emulate the concordance function and return the list of phrases for further processing. It uses the NLTK concordance Index which keeps track of the keyword index in the passage/text and retrieve the surrounding words.

Text mining is a very common part of topic map construction so tools that help with that task are always welcome.

To be honest, I am citing this because it may become part of several small tools for processing standards drafts. Concordance software is not rare but a full concordance of a document seems to frighten some proof readers.

The current thinking being if only the “important” terms are highlighted in context, that some proof readers will be more likely to use the work product.

The same principal applies to the authoring of topic maps as well.

Comments Off

April 14, 2014

tagtog: interactive and text-mining-assisted annotation…

Filed under: Annotation,Biomedical,Genomics,Text Mining — Patrick Durusau @ 8:55 am

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles by Juan Miguel Cejuela, et al.

Abstract:

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.

Database URL: www.tagtog.net, www.flybase.org.

Encouraging because the “tagging” is not wholly automated nor is it wholly hand-authored. Rather the goal is to create an interface that draws on the strengths of automated processing as moderated by human expertise.

Annotation remains at a document level, which consigns subsequent users to mining full text but this is definitely a step in the right direction.

Comments Off

March 18, 2014

eMOP Early Modern OCR Project

Filed under: OCR,Text Mining — Patrick Durusau @ 9:06 pm

eMOP Early Modern OCR Project

From the webpage:

The Early Modern OCR Project is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archiveable, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.

I first saw this project at: Automatic bulk OCR and full-text search for digital collections using Tesseract and Solr by Chris Adams.

I find it exciting because of the progress the project is making for texts between 1475-1800. For the texts in that time period for sure but also hoping those techniques can be adapted to older materials.

Say older by several thousand years.

Despite pretensions to the contrary, “web scale” is not very much when compared to data feeds from modern science colliders, telescopes, gene sequencing, etc., but also with the vast store of historical texts that remain off-line. To say nothing of the need for secondary analysis of those texts.

Every text that becomes available enriches a semantic tapestry that only humans can enjoy.

Comments Off

March 15, 2014

Words as Tags?

Filed under: Linguistics,Text Mining,Texts,Word Meaning — Patrick Durusau @ 8:46 pm

Wordcounts are amazing. by Ted Underwood.

From the post:

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?“

Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.

…

Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”
….

What a delightful insight!

When we say text is “unstructured” what we really mean is something as dumb as a computer sees no structure in the text.

A human reader, even a 5 or 6 year old reader of a text sees lots of structure, meaning too.

Rather than trying to “teach” computers to read, perhaps we should use computers to facilitate reading by those who already can.

Yes?

I first saw this in a tweet by Matthew Brook O’Donnell.

Comments Off

February 24, 2014

Word Storms:…

Filed under: Text Analytics,Text Mining,Visualization,Word Cloud — Patrick Durusau @ 1:58 pm

Word Storms: Multiples of Word Clouds for Visual Comparison of Documents by Quim Castellà and Charles Sutton.

Abstract:

Word clouds are popular for visualizing documents, but are not as useful for comparing documents, because identical words are not presented consistently across different clouds. We introduce the concept of word storms, a visualization tool for analyzing corpora of documents. A word storm is a group of word clouds, in which each cloud represents a single document, juxtaposed to allow the viewer to compare and contrast the documents. We present a novel algorithm that creates a coordinated word storm, in which words that appear in multiple documents are placed in the same location, using the same color and orientation, across clouds. This ensures that similar documents are represented by similar- looking word clouds, making them easier to compare and contrast visually. We evaluate the algorithm using an automatic evaluation based on document classification, and a user study. The results conrm that a coordinated word storm allows for better visual comparison of documents.

I never have cared for word clouds all that much but word storms as presented by the authors looks quite useful.

The paper examines the use of word storms at a corpus, document and single document level.

You will find Word Storms: Multiples of Word Clouds for Visual Comparison of Documents (website) of particular interest, including its like to Github for the source code used in this project.

Of particular interests for topic mappers is the observation:

similar documents should be represented by visually similar clouds (emphasis in original)

Now imagine for a moment visualizing topics and associations with “similar” appearances. Even if limited to colors that are easy to distinguish, that could be a very powerful display/discover tool for topic maps.

Not the paper’s use case but one that comes to mind with regard to display/discovery in a heterogeneous data set (such as a corpus of documents).

Comments Off

February 1, 2014

Introduction to Computational Linguistics (Scala too!)

Filed under: Computational Linguistics,Scala,Text Mining — Patrick Durusau @ 9:07 pm

Introduction to Computational Linguistics by Jason Baldridge.

From the webpage:

Advances in computational linguistics have not only led to industrial applications of language technology; they can also provide useful tools for linguistic investigations of large online collections of text and speech, or for the validation of linguistic theories.

Introduction to Computational Linguistics introduces the most important data structures and algorithmic techniques underlying computational linguistics: regular expressions and finite-state methods, categorial grammars and parsing, feature structures and unification, meaning representations and compositional semantics. The linguistic levels covered are morphology, syntax, and semantics. While the focus is on the symbolic basis underlying computational linguistics, a high-level overview of statistical techniques in computational linguistics will also be given. We will apply the techniques in actual programming exercises, using the programming language Scala. Practical programming techniques, tips and tricks, including version control systems, will also be discussed.

Jason has created a page of links, which includes a twelve part tutorial on Scala:

Part 1: the Scala REPL, expressions, variables, basic types, simple functions, saving and running programs, comments

Part 2: Tuples, Lists, methods on Lists and Strings

Part 3: Conditional execution with if-else blocks and matching

Part 4: Iterating, mapping, filtering and counting

Part 5: Regular expressions and matching with them

Part 6: Regular expression matching and substitution with the Regex API

Part 7: Maps, Sets, groupBy, Options, flatten, flatMap

Part 8: Word counting, scala.io.Source, file access, flatMap, mutable Maps

Part 9: Objects, classes, inheritance, traits, Lists with multiple related types, apply

Part 10: Scripting, compiling, main methods, return values of functions

Part 11: SBT, scalabha, packages, build systems

Part 12: Code blocks, coding style, closures, scala documentation project

If you want to walk through the course on your own, see the schedule.

Enjoy!

Comments Off

January 9, 2014

Getting Into Overview

Filed under: Data Mining,Document Management,News,Reporting,Text Mining — Patrick Durusau @ 7:09 pm

Getting your documents into Overview — the complete guide Jonathan Stray.

From the post:

The first and most common question from Overview users is how do I get my documents in? The answer varies depending the format of your material. There are three basic paths to get documents into Overview: as multiple PDFs, from a single CSV file, and via DocumentCloud. But there are several other tricks you might need, depending on your situation.

Great coverage of the first step towards using Overview.

Just in case you are not familiar with Overview (for the about page):

Overview is an open-source tool to help journalists find stories in large numbers of documents, by automatically sorting them according to topic and providing a fast visualization and reading interface. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.

There are good tools for searching within large document sets for names and keywords, but that doesn’t help find the stories you’re not specifically looking for. Overview visualizes the relationships among topics, people, and places to help journalists to answer the question, “What’s in there?”

Overview is designed specifically for text documents where the interesting content is all in narrative form — that is, plain English (or other languages) as opposed to a table of numbers. It also works great for analyzing social media data, to find and understand the conversations around a particular topic.

It’s an interactive system where the computer reads every word of every document to create a visualization of topics and sub-topics, while a human guides the exploration. There is no installation required — just use the free web application. Or you can run this open-source software on your own server for extra security. The goal is to make advanced document mining capability available to anyone who needs it.

Examples of people using Overview? See Completed Stories for a sampling.

Overview is a good response to government “disclosures” that attempt to hide wheat in lots of chaff.

Comments Off

January 6, 2014

Why the Feds (U.S.) Need Topic Maps

Filed under: Data Mining,Project Management,Relevance,Text Mining — Patrick Durusau @ 7:29 pm

Earlier today I saw this offer to “license” technology for commercial development:

ORNL’s Piranha & Raptor Text Mining Technology

From the post:

UT-Battelle, LLC, acting under its Prime Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy (DOE) for the management and operation of the Oak Ridge National Laboratory (ORNL), is seeking a commercialization partner for the Piranha/Raptor text mining technologies. The ORNL Technology Transfer Office will accept licensing applications through January 31, 2014.

ORNL’s Piranha and Raptor text mining technology solves the challenge most users face: finding a way to sift through large amounts of data that provide accurate and relevant information. This requires software that can quickly filter, relate, and show documents and relationships. Piranha is JavaScript search, analysis, storage, and retrieval software for uncertain, vague, or complex information retrieval from multiple sources such as the Internet. With the Piranha suite, researchers have pioneered an agent approach to text analysis that uses a large number of agents distributed over very large computer clusters. Piranha is faster than conventional software and provides the capability to cluster massive amounts of textual information relatively quickly due to the scalability of the agent architecture.

While computers can analyze massive amounts of data, the sheer volume of data makes the most promising approaches impractical. Piranha works on hundreds of raw data formats, and can process data extremely fast, on typical computers. The technology enables advanced textual analysis to be accomplished with unprecedented accuracy on very large and dynamic data. For data already acquired, this design allows discovery of new opportunities or new areas of concern. Piranha has been vetted in the scientific community as well as in a number of real-world applications.

The Raptor technology enables Piranha to run on SharePoint and MS SQL servers and can also operate as a filter for Piranha to make processing more efficient for larger volumes of text. The Raptor technology uses a set of documents as seed documents to recommend documents of interest from a large, target set of documents. The computer code provides results that show the recommended documents with the highest similarity to the seed documents.

Gee, that sounds so very hard. Using seed documents to recommend documents “…from a large, target set of documents.”?

Many ways to do that but just looking for “Latent Dirichlet Allocation” in “.gov” domains, my total is 14,000 “hits.”

If you were paying for search technology to be developed, how many times would you pay to develop the same technology?

Just curious.

In order to have a sensible development of technology process, the government needs a topic map to track its development efforts. Not only to track but prevent duplicate development.

Imagine if every web project had to develop its own httpd server, instead of the vast majority of them using Apache HTTPD.

With a common server base, a community has developed to maintain and extend that base product. That can’t happen where the same technology is contracted for over and over again.

Suggestions on what might be an incentive for the Feds to change their acquisition processes?

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 2, 2015

December 17, 2014

December 15, 2014

December 7, 2014

November 25, 2014

November 19, 2014

October 10, 2014

October 3, 2014

September 10, 2014

September 3, 2014

August 27, 2014

July 19, 2014

July 4, 2014

July 3, 2014

July 2, 2014

June 21, 2014

May 31, 2014

May 15, 2014

May 13, 2014

May 12, 2014

Release 8.0 (May 11th 2014)

April 18, 2014

April 14, 2014

March 18, 2014

March 15, 2014

February 24, 2014

February 1, 2014

January 9, 2014

January 6, 2014