Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 22, 2011

Disease Named Entity Recognition

Filed under: Entity Extraction,Machine Learning — Patrick Durusau @ 7:02 pm

Disease named entity recognition using semisupervised learning and conditional random fields.

Nichalin, S., Zhu, Z., & Hsinchun, C. (2011). Disease named entity recognition using semisupervised learning and conditional random fields. Journal of the American Society for Information Science & Technology, 62(4), 727-737.

Abstract:

Information extraction is an important text-mining task that aims at extracting prespecified types of information from large text collections and making them available in structured representations such as databases. In the biomedical domain, information extraction can be applied to help biologists make the most use of their digital-literature archives. Currently, there are large amounts of biomedical literature that contain rich information about biomedical substances. Extracting such knowledge requires a good named entity recognition technique. In this article, we combine conditional random fields (CRFs), a state-of-the-art sequence-labeling algorithm, with two semisupervised learning techniques, bootstrapping and feature sampling, to recognize disease names from biomedical literature. Two data-processing strategies for each technique also were analyzed: one sequentially processing unlabeled data partitions and another one processing unlabeled data partitions in a round-robin fashion. The experimental results showed the advantage of semisupervised learning techniques given limited labeled training data. Specifically, CRFs with bootstrapping implemented in sequential fashion outperformed strictly supervised CRFs for disease name recognition.

Not to take anything away from this sort of technique, which would stand in good stead for topic map construction, but I am left feeling like it stops short of the mark.

In other words, say that I am happy with the result of its recognition, how do I share that with someone else, who has another set of identified subjects, perhaps from the same data?

Or for that matter, how do I combine it with data that I myself have extracted from the same data?

Can’t very well ask the software why it “recognized” one name or another can I?

Thinking I would have to add what seemed to me to be useful information to the name, in order to re-use it with other data.

Starting to sound like a topic map isn’t it?

February 7, 2011

KEA: keyphrase extraction algorithm

Filed under: Entity Extraction,Natural Language Processing — Patrick Durusau @ 7:59 am

KEA: keyphrase extraction algorithm

From the website:

Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organize and provide a thematic access to their data.

KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.

Given the indexing roots of topic maps, this software is definitely a contender for use in topic map construction.

GATE: General Architecture for Text Engineering

Filed under: Entity Extraction,Natural Language Processing — Patrick Durusau @ 7:06 am

GATE: General Architecture for Text Engineering

From the website:

GATE is…

  • open source software capable of solving almost any text processing problem
  • a mature and extensive community of developers, users, educators, students and scientists
  • a defined and repeatable process for creating robust and maintainable text processing workflows
  • in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation
  • the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.
  • used by corporations, SMEs, research labs and Universities worldwide
  • the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining
  • a world-class team of language processing developers

If you need to solve a problem with text analysis or human language processing you’re in the right place.

I suppose there is something to be said for an abundance of confidence. 😉

Seriously, this is a very complex and impressive effort.

I will be covering specific tools and aspects of this effort as they relate to topic maps.

February 3, 2011

PyBrain: The Python Machine Learning Library

PyBrain: The Python Machine Learning Library

From the website:

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

How is PyBrain different?

While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. We are constantly working on more and faster algorithms, developing new environments and improving usability.

What PyBrain can do

PyBrain, as its written-out name already suggests, contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Another tool kit to assist in the construction of topic maps.

And another likely contender for the Topic Map Competition!

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

January 11, 2011

Dynamic Semantic Publishing for any Blog (Part 1 + 2) – Post(s)

Filed under: Entity Extraction,Semantic Web,Semantics — Patrick Durusau @ 5:10 pm

Dynamic Semantic Publishing for any Blog (Part 1)

Benjamin Nowack outlines how he would replicate the dynamic semantic publishing approach used by the BBC in their coverage of the 2010 World Cup.

Dynamic Semantic Publishing for any Blog (Part 2) will disappoint anyone interested in developing dynamic semantic publishing solutions.

Block level overview that repeats what anyone interested in semantic technologies already knows.

Extended infomercial.

Save your time and look elsewhere for substantive content on semantic publishing.

Linked Data Extraction with Zemata and OpenCalais

Filed under: Entity Extraction,Linked Data — Patrick Durusau @ 1:53 pm

Linked Data Extraction with Zemata and OpenCalais

Benjamin Nowack’s review at BNODE of Named Entity Extraction APIs by Zemanta and OpenCalais.

You can brew your own entity extraction routines and likely will for specialized domains. For more general work, or just to become familiar with entity extraction and its limitations, the APIs Benjamin reviews are a good starting place.

January 9, 2011

Apache UIMA

Apache UIMA

From the website:

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => “entity detection (person/place names etc.)”. Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

The UIMA project offers a number of annotators that produce structured information from unstructured texts.

If you are using UIMA as a framework for development of topic maps, please post concerning your experiences with UIMA. What works, what doesn’t, etc.

Center for Computational Analysis of Social and Organizational Systems (CASOS)

Center for Computational Analysis of Social and Organizational Systems (CASOS)

Home of both ORA and AutoMap but I thought it merited an entry of its own.

Directed by Dr. Kathleen Carley:

CASOS brings together computer science, dynamic network analysis and the empirical study of complex socio-technical systems. Computational and social network techniques are combined to develop a better understanding of the fundamental principles of organizing, coordinating, managing and destabilizing systems of intelligent adaptive agents (human and artificial) engaged in real tasks at the team, organizational or social level. Whether the research involves the development of metrics, theories, computer simulations, toolkits, or new data analysis techniques advances in computer science are combined with a deep understanding of the underlying cognitive, social, political, business and policy issues.

CASOS is a university wide center drawing on a group of world class faculty, students and research and administrative staff in multiple departments at Carnegie Mellon. CASOS fosters multi-disciplinary research in which students and faculty work with students and faculty in other universities as well as scientists and practitioners in industry and government. CASOS research leads the way in examining network dynamics and in linking social networks to other types of networks such as knowledge networks. This work has led to the development of new statistical toolkits for the collection and analysis of network data (Ora and AutoMap). Additionally, a number of validated multi-agent network models in areas as diverse as network evolution , bio-terrorism, covert networks, and organizational adaptation have been developed and used to increase our understanding of real socio-technical systems.

CASOS research spans multiple disciplines and technologies. Social networks, dynamic networks, agent based models, complex systems, link analysis, entity extraction, link extraction, anomaly detection, and machine learning are among the methodologies used by members of CASOS to tackle real world problems.

Definitely a group that bears watching by anyone interested in topic maps!

AutoMap – Extracting Topic Maps from Texts?

Filed under: Authoring Topic Maps,Entity Extraction,Networks,Semantics,Software — Patrick Durusau @ 10:59 am

AutoMap: Extract, Analyze and Represent Relational Data from Texts (according to its webpage).

From the webpage:

AutoMap is a text mining tool that enables the extraction of network data from texts. AutoMap can extract content analytic data (words and frequencies), semantic networks, and meta-networks from unstructured texts developed by CASOS at Carnegie Mellon. Pre-processors for handling pdf’s and other text formats exist. Post-processors for linking to gazateers and belief inference also exist. The main functions of AutoMap are to extract, analyze, and compare texts in terms of concepts, themes, sentiment, semantic networks and the meta-networks extracted from the texts. AutoMap exports data in DyNetML and can be used interoperably with *ORA.

AutoMap uses parts of speech tagging and proximity analysis to do computer-assisted Network Text Analysis (NTA). NTA encodes the links among words in a text and constructs a network of the linked words.

AutoMap subsumes classical Content Analysis by analyzing the existence, frequencies, and covariance of terms and themes.

For a rough cut at a topic map from a text, AutoMap looks like a useful tool.

In addition to the software, training material and other information is available.

My primary interest is the application of such a tool to legislative debates, legislation and court decisions.

None of those occur in a vacuum and topic maps could help provide a context for understand such material.

December 3, 2010

S4

S4

From the website:

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Just in case you were wondering if topic maps are limited to being bounded objects composed of syntax. No.

Questions:

  1. Specify three sources of unbounded streams of data. (3 pages, citations)
  2. What subjects would you want to identify and on what basis in any one of them? (3-5 pages, citations)
  3. What other information about those subjects would you want to bind to the information in #2? What subject identity tests are used for those subjects in other sources? (5-10 pages, citations)

November 29, 2010

TREC Entity Track: Plans for Entity 2011

Filed under: Conferences,Entity Extraction,TREC — Patrick Durusau @ 8:50 am

TREC Entity Track: Plans for Entity 2011

Plans for Entity 2011.

Known datasets of interest: ClueWeb09, DBPedia Ontology, Billion Triple Dataset

It’s not too early to get started for next year!

TREC Entity Track: Report from TREC 2010

Filed under: Conferences,Entity Extraction,TREC — Patrick Durusau @ 8:17 am

TREC Entity Track: Report from TREC 2010

A summary of the results for the TREC Entity Track (related entity finding (REF) search task on the WWW) for 2010.

October 15, 2010

T-Rex Information Extraction

Filed under: Document Classification,Entity Extraction,EU,Relation Extraction — Patrick Durusau @ 6:16 am

T-Rex (Trainable Relation Extraction).

Tools for document classification, entity and relation (read association) extraction.

Topic maps of any size are going to be constructed from mining of “data” and in a lot of cases that will mean “documents” (to the extent that is a meaningful distinction).

Interesting toolkit for that purpose but apparently not being maintained. Parked at Sourceforge after having been funded by the EU.

Does anyone have a status update on this project?

October 6, 2010

Mining Historic Query Trails to Label Long and Rare Search Engine Queries

Filed under: Authoring Topic Maps,Data Mining,Entity Extraction,Search Engines,Searching — Patrick Durusau @ 7:05 am

Mining Historic Query Trails to Label Long and Rare Search Engine Queries Authors: Peter Bailey, Ryen W. White, Han Liu, Giridhar Kumaran Keywords: Long queries, query labeling

Abstract:

Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user’s search intent that may not be apparent from the terms in the query. This affords search result matching or reranking based on queries and labels rather than the query text alone. Query labels can be derived from interaction logs generated from many users’ search result clicks or from query trails comprising the chain of URLs visited following query submission. However, since long queries are typically rare, they are difficult to label in this way because little or no historic log data exists for them. A subset of these queries may be amenable to labeling by detecting similarities between parts of a long and rare query and the queries which appear in logs. In this article, we present the comparison of four similarity algorithms for the automatic assignment of Open Directory Project category labels to long and rare queries, based solely on matching against similar satisfied query trails extracted from log data. Our findings show that although the similarity-matching algorithms we investigated have tradeoffs in terms of coverage and accuracy, one algorithm that bases similarity on a popular search result ranking function (effectively regarding potentially-similar queries as “documents”) outperforms the others. We find that it is possible to correctly predict the top label better than one in five times, even when no past query trail exactly matches the long and rare query. We show that these labels can be used to reorder top-ranked search results leading to a significant improvement in retrieval performance over baselines that do not utilize query labeling, but instead rank results using content-matching or click-through logs. The outcomes of our research have implications for search providers attempting to provide users with highly-relevant search results for long queries.

(Apologies for repeating the long abstract but this needs wider notice.)

What the authors call “label prediction algorithms,” is a step in mining data for subjects.

The research may also improve search results through the use of labels for ranking.

September 29, 2010

LingPipe

Filed under: Classification,Clustering,Entity Extraction,Full-Text Search,Searching — Patrick Durusau @ 7:06 am

LingPipe.

The tutorial listing for LingPipe is the best summary of its capabilities.

Its sandbox is another “must see” location.

There may be better introductions to linguistic processing but I haven’t seen them.

September 23, 2010

HUGO Gene Nomenclature Committee

Filed under: Bioinformatics,Biomedical,Data Mining,Entity Extraction,Indexing,Software — Patrick Durusau @ 8:32 am

HUGO Gene Nomenclature Committee, a committee assigning unique names to genes.

Become familiar with the HUGO site, then read: The success (or not) of HUGO nomenclature (Genome Biology, 2006).

Now read: Moara: a Java library for extracting and normalizing gene and protein mentions (BMC Bioinformatics 2010)

Q: How you would apply the techniques in the Moara article to build a topic map? Would you keep/discard normalization?

PS: Moara Project (software, etc.)

« Newer Posts

Powered by WordPress