Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 26, 2012

Semantic Assistants Wiki-NLP Integration

Filed under: Natural Language Processing,Semantics,Wiki — Patrick Durusau @ 3:27 pm

Natural Language Processing for MediaWiki: First major release of the Semantic Assistants Wiki-NLP Integration

From the post:

We are happy to announce the first major release of our Semantic Assistants Wiki-NLP integration. This is the first comprehensive open source solution for bringing Natural Language Processing (NLP) to wiki users, in particular for wikis based on the well-known MediaWiki engine and its Semantic MediaWiki (SMW) extension. It can run any NLP pipeline deployed in the General Architecture for Text Engineering (GATE), brokered as web services through the Semantic Assistants server. This allows you to bring novel text mining assistants to wiki users, e.g., for automatically structuring wiki pages, answering questions in natural language, quality assurance, entity detection, summarization, among others. The results of the NLP analysis are written back to the wiki, allowing humans and AI to work collaboratively on wiki content. Additionally, semantic markup understood by the SMW extension can be automatically generated from NLP output, providing semantic search and query functionalities.

Features:

  • Light-weight MediaWiki Extension
  • NLP Pipeline Independent Architecture
  • Flexible Wiki Input Handling
  • Flexible NLP Result Handling
  • Semantic Markup Generation
  • Wiki-independent Architecture

A promising direction for creation of author-curated text!

December 19, 2012

TSD 2013: 16th International Conference on Text, Speech and Dialogue

Filed under: Conferences,Natural Language Processing,Texts — Patrick Durusau @ 10:50 am

TSD 2013: 16th International Conference on Text, Speech and Dialogue

Important Dates:

When Sep 1, 2013 – Sep 5, 2013
Where Plzen (Pilsen), Czech Republic
Submission Deadline Mar 31, 2013
Notification Due May 12, 2013
Final Version Due Jun 9, 2013

Subjects for submissions:

  • Speech Recognition
    —multilingual, continuous, emotional speech, handicapped speaker, out-of-vocabulary words, alternative way of feature extraction, new models for acoustic and language modelling,
  • Corpora and Language Resources
    —monolingual, multilingual, text, and spoken corpora, large web corpora, disambiguation, specialized lexicons, dictionaries,
  • Speech and Spoken Language Generation
    —multilingual, high fidelity speech synthesis, computer singing,
  • Tagging, Classification and Parsing of Text and Speech
    —multilingual processing, sentiment analysis, credibility analysis, automatic text labeling, summarization, authorship attribution,
  • Semantic Processing of Text and Speech
    —information extraction, information retrieval, data mining, semantic web, knowledge representation, inference, ontologies, sense disambiguation, plagiarism detection,
  • Integrating Applications of Text and Speech Processing
    —machine translation, natural language understanding, question-answering strategies, assistive technologies,
  • Automatic Dialogue Systems
    —self-learning, multilingual, question-answering systems, dialogue strategies, prosody in dialogues,
  • Multimodal Techniques and Modelling
    —video processing, facial animation, visual speech synthesis, user modelling, emotion and personality modelling.

It was TSD 2012 where I found the presentation by Ruslan Mitkov presentation: Coreference Resolution: to What Extent Does it Help NLP Applications? So, highly recommended!

December 18, 2012

Coreference Resolution Tools : A first look

Filed under: Coreference Resolution,Natural Language Processing — Patrick Durusau @ 2:10 pm

Coreference Resolution Tools : A first look by Sharmila G Sivakumar.

From the post:

Coreference is where two or more noun phrases refer to the same entity. This is an integral part of natural languages to avoid repetition, demonstrate possession/relation etc.

Eg: Harry wouldn’t bother to read “Hogwarts: A History” as long as Hermione is around. He knows she knows the book by heart.

The different types of coreference includes:
Noun phrases: Hogwarts A history <- the book
Pronouns : Harry <- He
Possessives : her, his, their
Demonstratives: This boy

Coreference resolution or anaphor resolution is determining what an entity is refering to. This has profound applications in nlp tasks such as semantic analysis, text summarisation, sentiment analysis etc.

In spite of extensive research, the number of tools available for CR and level of their maturity is much less compared to more established nlp tasks such as parsing. This is due to the inherent ambiguities in resolution.

A bit dated (2010) now but a useful starting point for updating. (Specific to medical records, see: Evaluating the state of the art in coreference resolution for electronic medical records. Other references you would recommend?)

Sharmila goes on to compare the results of using the tools on a set text so you can get a feel for the tools.

Capturing/Defining/Interchanging Coreference Resolutions (Topic Maps!)

Filed under: Coreference Resolution,Natural Language Processing — Patrick Durusau @ 1:46 pm

While listening to Ruslan Mitkov presentation: Coreference Resolution: to What Extent Does it Help NLP Applications?, the thought occurred to me that coreference resolution lies at the core of topic maps.

A topic map can:

  • Capture a coreference resolution in one representative by merging it with another representative that “pick out the same referent.”
  • Define a coreference resolution by defining representatives that “pick out the same referent.”
  • Interchange coreference resolutions by defining the representation of referents that “pick out the same referent.”

Not to denigrate associations or occurrences, but they depend upon the presence topics, that is representatives that “pick out a referent.”

Merged topics being two or more topics that individually “picked out the same referent,” perhaps using different means of identification.

Rather than starting every coreference resolution application at zero, to test its algorithmic prowess, a topic map could easily prime the pump as it were with known coreference resolutions.

Enabling coreference resolution systems to accumulate resolutions, much as human users do.*

*This may be useful because coreference resolution is a recognized area of research in computational linguistics, unlike topic maps.

Coreference Resolution: to What Extent Does it Help NLP Applications?

Coreference Resolution: to What Extent Does it Help NLP Applications? by Ruslan Mitkov. (presentation – audio only)

The paper from the same conference:

Coreference Resolution: To What Extent Does It Help NLP Applications? by Ruslan Mitkov, Richard Evans, Constantin Orăsan, Iustin Dornescu, Miguel Rios. (Text, Speech and Dialogue, 15th International Conference, TSD 2012, Brno, Czech Republic, September 3-7, 2012. Proceedings, pp. 16-27)

Abstract:

This paper describes a study of the impact of coreference resolution on NLP applications. Further to our previous study [1], in which we investigated whether anaphora resolution could be beneficial to NLP applications, we now seek to establish whether a different, but related task—that of coreference resolution, could improve the performance of three NLP applications: text summarisation, recognising textual entailment and text classification. The study discusses experiments in which the aforementioned applications were implemented in two versions, one in which the BART coreference resolution system was integrated and one in which it was not, and then tested in processing input text. The paper discusses the results obtained.

In the presentation and in the paper, Mitkov distinguishes between anaphora and coreference resolution (from the paper):

While some authors use the terms coreference (resolution) and anaphora (resolution) interchangeably, it is worth noting that they are completely distinct terms or tasks [3]. Anaphora is cohesion which points back to some previous item, with the ‘pointing back’ word or phrase called an anaphor, and the entity to which it refers, or for which it stands, its antecedent. Coreference is the act of picking out the same referent in the real world. A specific anaphor and more than one of the preceding (or following) noun phrases may be coreferential, thus forming a coreferential chain of entities which have the same referent.

I am not sure why the “real world” is necessary in: “Coreference is the act of picking out the same referent in the real world.”

For topic maps, I would shorten it to: Coreference is the act of picking out the same referent. (full stop)

The paper is a useful review of coreference systems and quite unusually, reports a negative result:

This study sought to establish whether or not coreference resolution could have a positive impact on NLP applications, in particular on text summarisation, recognising textual entailment, and text categorisation. The evaluation results presented in Section 6 are in line with previous experiments conducted both by the present authors and other researchers: there is no statistically significant benefit brought by automatic coreference resolution to these applications. In this specific study, the employment of the coreference resolution system distributed in the BART toolkit generally evokes slight but not significant increases in performance and in some cases it even evokes a slight deterioration in the performance results of these applications. We conjecture that the lack of a positive impact is due to the success rate of the BART coreference resolution system which appears to be insufficient to boost performance of the aforementioned applications.

My conjecture is topic maps can boost conference resolution enough to improve performance of NLP applications, including text summarisation, recognising textual entailment, and text categorisation.

What do you think?

How would you suggest testing that conjecture?

November 25, 2012

A first failed attempt at Natural Language Processing

Filed under: Natural Language Processing,Requirements — Patrick Durusau @ 1:40 pm

A first failed attempt at Natural Language Processing by Mark Needham

From the post:

One of the things I find fascinating about dating websites is that the profiles of people are almost identical so I thought it would be an interesting exercise to grab some of the free text that people write about themselves and prove the similarity.

I’d been talking to Matt Biddulph about some Natural Language Processing (NLP) stuff he’d been working on and he wrote up a bunch of libraries, articles and books that he’d found useful.

I started out by plugging the text into one of the many NLP libraries that Matt listed with the vague idea that it would come back with something useful.

I’m not sure exactly what I was expecting the result to be but after 5/6 hours of playing around with different libraries I’d got nowhere and parked the problem not really knowing where I’d gone wrong.

Last week I came across a paper titled “That’s What She Said: Double Entendre Identification” whose authors wanted to work out when a sentence could legitimately be followed by the phrase “that’s what she said”.

While the subject matter is a bit risque I found that reading about the way the authors went about solving their problem was very interesting and it allowed me to see some mistakes I’d made.

Vague problem statement

Unfortunately I didn’t do a good job of working out exactly what problem I wanted to solve – my problem statement was too general.

Question: How do you teach people how to create useful problem statements?

Pointers, suggestions?

November 18, 2012

European Parliament Proceedings Parallel Corpus 1996-2011

European Parliament Proceedings Parallel Corpus 1996-2011

From the webpage:

For a detailed description of this corpus, please read:

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

Version 7, released in May of 2012, has around 60 million words per language.

Just in case you need a corpus for the EU.

I would be mindful of its parlimentary context. Semantic equivalent or similarity there may not hold true for other contexts.

A Language-Independent Approach to Keyphrase Extraction and Evaluation

A Language-Independent Approach to Keyphrase Extraction and Evaluation (2010) by Mari-sanna Paukkeri, Ilari T. Nieminen, Matti Pöllä and Timo Honkela.

Abstract:

We present Likey, a language-independent keyphrase extraction method based on statistical analysis and the use of a reference corpus. Likey has a very light-weight preprocessing phase and no parameters to be tuned. Thus, it is not restricted to any single language or language family. We test Likey having exactly the same configuration with 11 European languages. Furthermore, we present an automatic evaluation method based on Wikipedia intra-linking.

Useful approach for developing a rough-cut of keywords in documents. Keywords that may indicate a need for topics to represent subjects.

Interesting that:

Phrases occurring only once in the document cannot be selected as keyphrases.

I would have thought unique phrases would automatically qualify as keyphrases. The ranking of phrases, calculated with the reference corpus and text, excludes unique phrases, in the absence of any ratio for ranking.

That sounds like a bug and not a feature to me.

Reasoning that phrases unique to an author are unique identifications of subjects. Certainly grist for a topic map mill.

Web based demonstration: http://cog.hut.fi/likeydemo/.

Mari-Sanna Paukkeri: Contact details and publications.

November 11, 2012

Python interface to Stanford Core NLP tools v1.3.3

Filed under: Natural Language Processing,Python,Stanford NLP — Patrick Durusau @ 5:25 am

Python interface to Stanford Core NLP tools v1.3.3

From the README.md:

This is a Python wrapper for Stanford University’s NLP group’s Java-based CoreNLP tools. It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.

  • Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, named entity resolution, and coreference resolution.
  • Runs an JSON-RPC server that wraps the Java server and outputs JSON.
  • Outputs parse trees which can be used by nltk.

It requires pexpect and (optionally) unidecode to handle non-ASCII text. This script includes and uses code from jsonrpc and python-progressbar.

It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on Core NLP tools version 1.3.3 released 2012-07-09.

If you have NLP requirements and work in Python, this may be of interest.

October 26, 2012

Open Source Natural Language Spell-Checker [Disambiguation at the point of origin.]

Automattic Open Sources Natural Language Spell-Checker After the Deadline by Jolie O’Dell.

I am sure the original headline made sense to its author, but I wonder how a natural language processor would react to it?

My reaction, being innocent of any prior knowledge of the actors or the software was: What deadline? Reading it as a report of a missed deadline.

It is almost a “who’s on first” type headline. The software’s name is “After the Deadline.”

That confusion resolved, I read:

Matt Mullenweg has just annouced on his blog that WordPress parent company Automattic is open sourcing After the Deadline, a natural-language spell-checking plugin for WordPress and TinyMCE that was only recently ushered into the Automattic fold.

Scarcely seven weeks after its acquisition was announced, After the Deadline’s core technology is being released under the GPL. Moreover, writes Mullenweg, “There’s also a new jQuery API that makes it easy to integrate with any text area.”

Interested parties can check out this demo or read the tech overview and grab the source code here.

I can use spelling/grammar suggestions. Particularly since I make the same mistakes over and over again.

Does that also mean I talk about the same subjects/entities over and over again? Or at least a limited range of subjects/entities?

Imagine a user configurable subject/entity “checker” that annotated recognized subjects/entities with an <a> element. Enabling the user to accept/reject the annotation.

Disambiguation at the point of origin.

The title of the original article could become:

“<a href=”http://automattic.com/”>Automattic</a> Open Sources Natural Language Spell-Checker <a href=”http://www.afterthedeadline.com/”>After the Deadline</a>”

Seems less ambiguous to me.

Certainly less ambiguous to a search engine.

You?

October 19, 2012

Ngram Viewer 2.0 [String Usage != Semantic Usage]

Filed under: GoogleBooks,Natural Language Processing,Ngram Viewer — Patrick Durusau @ 3:32 pm

Ngram Viewer 2.0 by Jon Orwant.

From the post:

Since launching the Google Books Ngram Viewer, we’ve been overjoyed by the public reception. Co-creator Will Brockman and I hoped that the ability to track the usage of phrases across time would be of interest to professional linguists, historians, and bibliophiles. What we didn’t expect was its popularity among casual users. Since the launch in 2010, the Ngram Viewer has been used about 50 times every minute to explore how phrases have been used in books spanning the centuries. That’s over 45 million graphs created, each one a glimpse into the history of the written word. For instance, comparing flapper, hippie, and yuppie, you can see when each word peaked:

(graphic omitted)

Meanwhile, Google Books reached a milestone, having scanned 20 million books. That’s approximately one-seventh of all the books published since Gutenberg invented the printing press. We’ve updated the Ngram Viewer datasets to include a lot of those new books we’ve scanned, as well as improvements our engineers made in OCR and in hammering out inconsistencies between library and publisher metadata. (We’ve kept the old dataset around for scientists pursuing empirical, replicable language experiments such as the ones Jean-Baptiste Michel and Erez Lieberman Aiden conducted for our Science paper.)

Tracking the usage of phrases through time is no mean feat, but tracking their semantics would be far more useful.

For example, “freedom of speech” did not have the same “semantic” in the early history of the United States that it does today. Otherwise, how would you explain criminal statutes against blasphemy and their enforcement after the ratification of the US Constitution? (I have verified this but Wikipedia, Blasphemy Law in the United States, reports a person being jailed for blasphemy in the 1830’s.)

Or the guarantee of “freedom of speech,” in Article 125 of the 1936 Constitution of the USSR.

Those three usages, current United States, early United States, USSR 1936 (English translation), don’t have the same semantics to me.

You?

October 14, 2012

Tech That Protects The President, Part 1: Data Mining

Filed under: Data Mining,Natural Language Processing,Semantics — Patrick Durusau @ 3:41 pm

Tech That Protects The President, Part 1: Data Mining by Alex Popescu.

From the post:

President Obama’s appearance at the Democratic National Convention in September took place amid a rat’s nest of perils. But the local Charlotte, North Carolina, police weren’t entirely on their own. They were aided by a sophisticated data mining system that helped them identify threats and react to them quickly. (Part 1 of a 3-part series about the technology behind presidential security.)

The Charlotte-Mecklenberg police used a software from lxReveal to monitor the Internet for associations between Obama, the DNC, and potential treats. The company’s program, known as uReveal, combs news articles, status updates, blog posts, discussion forum comments. But it doesn’t simply search for keywords. It works on concepts defined by the user and uses natural language processing to analyze plain English based on meaning and context, taking into account slang and sentiment. If it detects something amiss, the system sends real-time alerts.

“We are able to read and alert almost as fast as [information] comes on the Web, as opposed to other systems where it takes hours,” said Bickford, vice president of operations of IxReveal.

In the past, this kind of task would have required large numbers of people searching and then reading huge volumes of information and manually highlighting relevant references. “Normally you have to take information like an email and shove it in to a database,” Bickford explained. “Someone has to physically read it or do a keyword search.

uReveal, on the other hand, lets machines do the reading, tracking, and analysis. “If you apply our patented technology and natural language processing capability, you can actually monitor that information for specific keywords and phrases based on meaning and context,” he says. The software can differentiate between a Volkswagen bug, a computer bug and an insect bug, Bickford explained – or, more to the point, between a reference to fire from a gun barrel and on to fire in a fireplace.

Bickford says the days of people slaving over sifting through piles of data, or ETL (extract, transform and load) data processing capabilities are over. “It’s just not supportable.”

I understand product promotion but do you think potential assassins are publishing letters to the editor, blogging or tweeting about their plans or operational details?

Granting contract killers in Georgia are caught when someone tries to hire an undercover police officer as a “hit” man.

Does that expectation of dumbness apply in other cases as well?

Or, is searching large amounts of data like the drunk looking for his keys under the street light?

A case of “the light is better here?”

October 9, 2012

A Semantic Look at the Presidential Debates

Filed under: Debate,Natural Language Processing,Politics,Semantics — Patrick Durusau @ 3:30 pm

A Semantic Look at the Presidential Debates

Warning: For entertainment purposes only.*

Angela Guess reports:

Luca Scagliarini of Expert System reports, “This week’s presidential debate is being analyzed across the web on a number of fronts, from a factual analysis of what was said, to the number of tweets it prompted. Instead, we used our Cogito semantic engine to analyze the transcript of the debate through a semantic and linguistic lens. Cogito extracted the responses by question, breaking sentences down to their granular detail. This analysis allows us to look at the individual language elements to better understand what was said, as well as how the combined effect of word choice, sentence structure and sentence length might be interpreted by the audience.”

The full post: Presidential Debates 2012: Semantically speaking

*I don’t doubt the performance of the Cogito engine, just the semantics, if any, of the target content. 😉

October 8, 2012

Are Expert Semantic Rules so 1980’s?

In The Geometry of Constrained Structured Prediction: Applications to Inference and Learning of Natural Language Syntax André Martins proposes advances in inferencing and learning for NLP processing. And it is important work for that reason.

But in his introduction to recent (and rapid) progress in language technologies, the following text caught my eye:

So, what is the driving force behind the aforementioned progress? Essentially, it is the alliance of two important factors: the massive amount of data that became available with the advent of the Web, and the success of machine learning techniques to extract statistical models from the data (Mitchell, 1997; Manning and Schötze, 1999; Schölkopf and Smola, 2002; Bishop, 2006; Smith, 2011). As a consequence, a new paradigm has emerged in the last couple of decades, which directs attention to the data itself, as opposed to the explicit representation of knowledge (Abney, 1996; Pereira, 2000; Halevy et al., 2009). This data-centric paradigm has been extremely fruitful in natural language processing (NLP), and came to replace the classic knowledge representation methodology which was prevalent until the 1980s, based on symbolic rules written by experts. (emphasis added)

Are RDF, Linked Data, topic maps, and other semantic technologies caught in a 1980’s “symbolic rules” paradigm?

Are we ready to make the same break that NLP did, what, thirty (30) years ago now?

To get started on the literature, consider André’s sources:

Abney, S. (1996). Statistical methods and linguistics. In The balancing act: Combining symbolic and statistical approaches to language, pages 1–26. MIT Press, Cambridge, MA.

A more complete citation: Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996. (Link is to PDF of Abney’s paper.)

Pereira, F. (2000). Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769):1239–1253.

I added a pointer to the Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences abstract for the article. You can see it at: Formal grammar and information theory: together again? (PDF file).

Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2):8–12.

I added a pointer to the Intelligent Systems, IEEE abstract for the article. You can see it at: The unreasonable effectiveness of data (PDF file).

The Halevy article doesn’t have an abstract per se but the ACM reports one as:

Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears to be to embrace the complexity of the domain and address it by harnessing the power of data: if other humans engage in the tasks and generate large amounts of unlabeled, noisy data, new algorithms can be used to build high-quality models from the data. [ACM]

That sounds like a challenge to me. You?

PS: I saw the pointer to this thesis at Christophe Lalanne’s A bag of tweets / September 2012

October 6, 2012

ReFr: A New Open-Source Framework for Building Reranking Models

Filed under: Natural Language Processing,Ranking — Patrick Durusau @ 1:09 pm

ReFr: A New Open-Source Framework for Building Reranking Models by Dan Bikel and Keith Hall.

From the post:

We are pleased to announce the release of an open source, general-purpose framework designed for reranking problems, ReFr (Reranker Framework), now available at: http://code.google.com/p/refr/.

Many types of systems capable of processing speech and human language text produce multiple hypothesized outputs for a given input, each with a score. In the case of machine translation systems, these hypotheses correspond to possible translations from some sentence in a source language to a target language. In the case of speech recognition, the hypotheses are possible word sequences of what was said derived from the input audio. The goal of such systems is usually to produce a single output for a given input, and so they almost always just pick the highest-scoring hypothesis.

A reranker is a system that uses a trained model to rerank these scored hypotheses, possibly inducing a different ranked order. The goal is that by employing a second model after the fact, one can make use of additional information not available to the original model, and produce better overall results. This approach has been shown to be useful for a wide variety of speech and natural language processing problems, and was the subject of one of the groups at the 2011 summer workshop at Johns Hopkins’ Center for Language and Speech Processing. At that workshop, led by Professor Brian Roark of Oregon Health & Science University, we began building a general-purpose framework for training and using reranking models. The result of all this work is ReFr.

An interesting software package and you are going to pick up some coding experience as well.

October 4, 2012

GATE, NLTK: Basic components of Machine Learning (ML) System

Filed under: Machine Learning,Natural Language Processing,NLTK — Patrick Durusau @ 4:03 pm

GATE, NLTK: Basic components of Machine Learning (ML) System by Krishna Prasad.

From the post:

I am currently building a Machine Learning system. In this blog I want to captures the elements of a machine learning system.

My definition of a Machine Learning System is to take voice or text inputs from a user and provide relevant information. And over a period of time, learn the user behavior and provides him better information. Let us hold on to this comment and dissect apart each element.

In the below example, we will consider only text input. Let us also assume that the text input will be a freeflowing English text.

  • As a 1st step, when someone enters a freeflowing text, we need to understand what is the noun, what is the verb, what is the subject and what is the predicate. For doing this we need a Parts of Speech analyzer (POS), for example “I want a Phone”. One of the components of Natural Language Processing (NLP) is POS.
  • For associating relationship between a noun and a number, like “Phone greater than 20 dollers”, we need to run the sentence thru a rule engine. The terminology used for this is Semantic Rule Engine
  • The 3rd aspect is the Ontology, where in each noun needs to translate to a specific product or a place. For example, if someone says “I want a Bike” it should translate as “I want a Bicycle” and it should interpret that the company that manufacture a bicycle is BSA, or a Trac. We typically need to build a Product Ontology
  • Finally if you have buying pattern of a user and his friends in the system, we need a Recommendation Engine to give the user a proper recommendation

What would you add (or take away) to make the outlined system suitable as a topic map authoring assistant?

Feel free to add more specific requirements/capabilities.

I first saw this at DZone.

September 26, 2012

KONVENS2012: The 11th Conference on Natural Language Processing (proceedings)

Filed under: Natural Language Processing — Patrick Durusau @ 4:05 pm

KONVENS2012: The 11th Conference on Natural Language Processing (proceedings) Vienna, September 19-21, 2012

As is usually the case, one find (corpus analysis) leads to another.

In this case a very interesting set of conference proceedings on natural language processing.

Just scanning the titles I see several that will be of interest to topic mappers.

Enjoy!

September 25, 2012

conceptClassifier for SharePoint 2010

Filed under: Natural Language Processing,Searching,SharePoint — Patrick Durusau @ 2:12 pm

conceptClassifier for SharePoint 2010 (PDF – White paper on conceptClassifier)

I encountered this white paper in a post at Beyond Search: Concept Searching Enrolls University of California.

Comparison of Sharepoint 2010 to FAST Search and conceptClassifier:

Sharepoint-conceptClassifier-1

Sharepoint-conceptClassifier-2

A comparison to other Sharepoint enhancement tools would be more useful.

Did you see anything particularly remarkable in the listed capabilities?

May not be common for Sharepoint users but auto-tagging of content has been a mainstay of NLP projects for decades.

September 24, 2012

Alpinism & Natural Language Processing

Filed under: Natural Language Processing,Topic Maps — Patrick Durusau @ 3:53 pm

Alpinism & Natural Language Processing

You will find a quote in this posting that reads:

“In linguistics and cultural studies, the change of language use over time, special terminology and cultural shifts are of interest. The ”speaking” about mountains is characterised by cultural, historical and social factors; therefore, language use can be viewed as a mirror of these factors. The extra-linguistic world, the essence of a culture, can be reconstructed through analyzing language use within alpine literature in terms of temporal and local specifics that emerged from this typical use of language (Bubenhofer, 2009). For instance, frequent use of personal pronouns and specific intensifiers in texts between 1930 and 1950 can be interpreted as a shift to a more subjective, personal role that mountaineering played in society. In contrary, between 1880 and 1900, the language surface shows less emotionality which probably is a mirror of a period when the moun- tain pioneers claimed more seriousness (Bubenhofer and Schro ̈ter, 2010).”

I thought this might prove interesting to topic map friends who live in areas where mountains and mountain climbing are common.

August 19, 2012

Concept Annotation in the CRAFT corpus

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 4:47 pm

Concept Annotation in the CRAFT corpus by Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A. Baumgartner, K. Bretonnel Cohen, Karin Verspoor, Judith A. Blake and Lawrence E. Hunter by BMC Bioinformatics 2012, 13:161 doi:10.1186/1471-2105-13-161.

Abstract:

Background

Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

Results

This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

Conclusions

As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Lessons on what it takes to create a “gold standard” corpus to advance NLP application development.

What do you think the odds are of “high inter[author] agreement” in the absence of such planning and effort?

Sorry, I meant “high interannotator agreement.”

Guess we have to plan for “low inter[author] agreement.”

Suggestions?

Gold Standard (or Bronze, Tin?)

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools by Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter. BMC Bioinformatics 2012, 13:207 doi:10.1186/1471-2105-13-207.

Abstract:

Background

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.

Results

Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.

Conclusions

The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

This is the article that I discovered and then worked my way to it from BioNLP.

Important as a deeply annotated text corpus.

But also a reminder that human annotators created the “gold standard,” against which other efforts are judged.

If you are ill, do you want gold standard research into the medical literature (which involves librarians)? Or is bronze or tin standard research good enough?

PS: I will be going back to pickup the other resources as appropriate.

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 3:41 pm

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

From the Quick Facts:

  • 67 full text articles
  • >560,000 Tokens
  • >21,000 Sentences
  • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies
    • Chemical Entities of Biological Interest (ChEBI)
    • Cell Type Ontology (CL)
    • Entrez Gene
    • Gene Ontology (biological process, cellular component, and molecular function)
    • NCBI Taxonomy
    • Protein Ontology
    • Sequence Ontology
  • Penn Treebank markup for each sentence
  • Multiple output formats available

Let’s see: 67 articles resulted in 100,000 concept annotations, or about 1,493 per article for seven (7) ontologies/terminologies.

Ready to test this mapping out in your topic map application?

finding names in common crawl

Filed under: Common Crawl,Natural Language Processing — Patrick Durusau @ 1:34 pm

finding names in common crawl by Mat Kelcey.

From the post:

the central offering from common crawl is the raw bytes they’ve downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they’ve done this extraction as a part of post processing the crawl and it’s freely available too!

If you don’t know “common crawl,” now would be a good time to meet the project.

From their webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

Mat gets you started by looking for names in the common crawl data set.

August 12, 2012

Calais Release 4.6 is Available for Beta Testing [Through 23rd of August]

Filed under: Natural Language Processing,OpenCalais — Patrick Durusau @ 12:16 pm

Calais Release 4.6 is Available for Beta Testing [Through 23rd of August]

From the post:

As we mentioned in our prior post, Version 4.6 of OpenCalais is now available for beta testing. While we should have 100% backward compatibility – it’s always a good idea to run a set of transaction through and make sure there are no issues.

You’ll see a number of new things in this release:

  • Under the covers we’ve upgraded our core processing engine. While this won’t directly affect you as an end user – it does set the stage for further improvements in the future.
  • We’ve improved the quality of the Company and Person extraction. Not surprisingly, these are two of our most frequently used concepts and we want them to be insanely great – we’re getting there.
  • We’ve updated and refreshed our Social Tags feature. If you haven’t had a chance to experiment with Social Tags in the past, give it a try. This is a great way to immediately improve the “findability” of your content.
  • We’ve introduced six new concepts that we’ll discuss below.

PersonParty extracts information about the affiliation of a person with a political party. CandidatePosition extracts information on past, current and aspirational political positions for a candidate. ArmedAttack extracts information regarding and attack by a person or organization on a country, organization or political figure. MilitaryAction extracts references to non-combative military actions such as troop deployments or movements. ArmsPurchaseSale Extracts information on planned, proposed or consummated arms sales. PersonLocation extracts information on where a person lives or is traveling.

So, it’s the Politics and Conflict pack – always popular topics.

More details at the post (including release notes).

Get your comments in early! Planned end of beta test: 23rd of August 2012.

August 10, 2012

MedLingMap

Filed under: Medical Informatics,Natural Language Processing — Patrick Durusau @ 9:27 am

MedLingMap

From the “welcome” entry:

MedLingMap is a growing resource providing a map of NLP systems and research in the Medical Domain. The site is being developed as part of the NLP Systems in the Medical Domain course in Brandeis University’s Computational Linguistics Master’s Program, taught by Dr. Marie Meteer. Learn more about the students doing the work.

MedLIngMap brings together the many different references, resources, organizations, and people in this very diverse domain. By using a faceted indexing approach to organizing the materials, MedLingMap can capture not only the content, but also the context by including facets such as the applications of the technology, the research or development group it was done by, and the techniques and algorithms that were utilized in developing the technology.

Not a lot of resources listed but every project has to start somewhere.

Capturing the use of specific techniques and algorithms will make this a particularly useful resource.

August 5, 2012

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

July 7, 2012

Natural Language Processing | Hub

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at info@cilenis.com

Definitely going on my short list of sites to check!

On the origin of long-range correlations in texts

Filed under: Natural Language Processing,Text Analytics — Patrick Durusau @ 2:53 pm

On the origin of long-range correlations in texts by Eduardo G. Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti.

Abstract:

The complexity of human interactions with social and natural phenomena is mirrored in the way we describe our experiences through natural language. In order to retain and convey such a high dimensional information, the statistical properties of our linguistic output has to be highly correlated in time. An example are the robust observations, still largely not understood, of correlations on arbitrary long scales in literary texts. In this paper we explain how long-range correlations flow from highly structured linguistic levels down to the building blocks of a text (words, letters, etc..). By combining calculations and data analysis we show that correlations take form of a bursty sequence of events once we approach the semantically relevant topics of the text. The mechanisms we identify are fairly general and can be equally applied to other hierarchical settings.

Another area of arXiv.org, Physics > Data Analysis, Statistics and Probability, to monitor. 😉

The authors used ten (10) novels from Project Gutenberg:

  • Alice’s Adventures in Wonderland
  • The Adventures of Tom Sawyer
  • Pride and Prejudice
  • Life on the Mississippi
  • The Jungle
  • The Voyage of the Beagle
  • Moby Dick; or The Whale
  • Ulysses
  • Don Quixote
  • War and Peace

Interesting research that will take a while to digest but I have to wonder why these ten (10) novels?

Or perhaps better, in an age of “big data,” why only ten (10)?

Why not the entire corpus of Project Gutenberg?

Or perhaps the texts of Wikipedia in its multitude of languages?

Reasoning that if the results represent an insight about natural language, they should be applicable beyond English. Yes?

If this is your area, comments and suggestions would be most welcome.

June 5, 2012

Dominic Widdows

Filed under: Data Mining,Natural Language Processing,Researchers,Visualization — Patrick Durusau @ 7:57 pm

While tracking references, I ran across the homepage of Dominic Widdows at Google.

Actually I found the Papers and Publications page for Dominic Widdows and then found his homepage. 😉

There is much to be read here.

DBLP page for Dominic Widdows.

June 3, 2012

Reconcile – Coreference Resolution Engine

Filed under: Coreference Resolution,Natural Language Processing — Patrick Durusau @ 3:36 pm

Reconcile – Coreference Resolution Engine

While we are on the topic of NLP tools:

Reconcile is an automatic coreference resolution system that was developed to provide a stable test-bed for researchers to implement new ideas quickly and reliably. It achieves roughly state of the art performance on many of the most common coreference resolution test sets, such as MUC-6, MUC-7, and ACE. Reconcile comes ready out of the box to train and test on these common data sets (though the data sets are not provided) as well as the ability to run on unlabeled texts. Reconcile utilizes supervised machine learning classifiers from the Weka toolkit, as well as other language processing tools such as the Berkeley Parser and Stanford Named Entity Recognition system.

The source language is Java, and it is freely available under the GPL.

Just in case you want to tune/tweak your coreference resolution against your data sets.

« Newer PostsOlder Posts »

Powered by WordPress