Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 3, 2012

FreeLing 3.0 – An Open Source Suite of Language Analyzers

FreeLing 3.0 – An Open Source Suite of Language Analyzers

Features:

Main services offered by FreeLing library:

  • Text tokenization
  • Sentence splitting
  • Morphological analysis
  • Suffix treatment, retokenization of clitic pronouns
  • Flexible multiword recognition
  • Contraction splitting
  • Probabilistic prediction of unkown word categories
  • Named entity detection
  • Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)
  • PoS tagging
  • Chart-based shallow parsing
  • Named entity classification
  • WordNet based sense annotation and disambiguation
  • Rule-based dependency parsing
  • Nominal correference resolution

[Not all features are supported for all languages, see Supported Languages.]

TOC for the user manual.

Something for your topic map authoring toolkit!

(Source: Jack Park)

Creating a Semantic Graph from Wikipedia

Creating a Semantic Graph from Wikipedia by Ryan Tanner, Trinity University.

Abstract:

With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps” which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history.

Of particular interest is Ryan’s approach to the problem:

Most approaches to this problem rely on extracting as much information as possible from a given input. My approach comes at the problem from the opposite direction and tries to extract a little bit of information very quickly but over an extremely large input set. My hypothesis is that by doing so a large collection of texts can be quickly processed while still yielding useful output.

A refreshing change from semantic orthodoxy that has a happy result.

Printing the thesis now for a close read.

(Source: Jack Park)

May 27, 2012

Mihai Surdeanu

Filed under: Natural Language Processing,Researchers — Patrick Durusau @ 7:23 pm

I ran across Mihai Surdeanu‘s publication page while hunting down an NLP article.

There are pages for software and other resources as well.

Enjoy!

LREC Conferences

Filed under: Conferences,Natural Language Processing — Patrick Durusau @ 10:28 am

LREC Conferences

From the webpage:

The International Conference on Language Resources and Evaluation is organised by ELRA biennially with the support of institutions and organisations involved in HLT.

LREC Conferences bring together a large number of people working and interested in HLT.

Full proceedings, including workshops, tutorials, papers, etc., are available from 2002 forward!

I almost forgot to hit “save” for this post because I was reading a tutorial on Arabic parsing. 😉

You really owe it to yourself to see this resource.

Hundreds of papers at each conference on issues relevant to your processing of texts.

Getting a paper accepted here should be your goal after seeing the prior proceedings!

Once you get excited about the prior proceedings and perhaps attending in the future, here is my question:

How do you make the proceedings from prior conferences effectively available?

Subject to indexing/search over the WWW now but that isn’t what I mean.

How do you trace the development of techniques or ideas across conferences or papers, without having to read each and every paper?

Moreover, can you save those who follow you the time/trouble of reading every paper to duplicate your results?

May 18, 2012

Using BerkeleyDB to Create a Large N-gram Table

Filed under: BerkeleyDB,N-Gram,Natural Language Processing,Wikipedia — Patrick Durusau @ 3:16 pm

Using BerkeleyDB to Create a Large N-gram Table by Richard Marsden.

From the post:

Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.

Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.

Richard promises to make the resulting data sets available as an Azure service. Sample code, etc, will be posted to his blog.

Another Wikipedia based analysis.

Interannotator Agreement for Chunking Tasks liked Named Entities and Phrases

Filed under: Annotation,LingPipe,Natural Language Processing — Patrick Durusau @ 2:40 pm

Interannotator Agreement for Chunking Tasks liked Named Entities and Phrases

Bob Carpenter writes:

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

An issue that is likely to come up in crowd sourcing analysis/annotation of text as well.

May 15, 2012

Natural Language Processing – Nearly Universal Case?

Filed under: Natural Language Processing — Patrick Durusau @ 1:54 pm

I was reading a paper on natural language processing (NLP) when it occurred to me to ask:

When is parsing of any data not natural language processing?

I hear the phrase, “natural language processing,” applied to a corpus of emails, blog posts, web pages, electronic texts, transcripts of international phone calls and the like.

Other than following others out of habit, why do we say those are subject to “natural language processing?”

As opposed to say a database schema?

When we “process” the column headers in a database schema, aren’t we engaged in “natural language processing?” What about SGML/XML schemas or instances they govern?

Being mindful of semantics, synonymy and polysemy, it’s hard think of examples that are not “natural language processing.”

At least for data that would be meaningful if read by a person. Streams of numbers perhaps not, but the symbolism that defines their processing I would argue falls under natural language processing.

Thoughts?

May 13, 2012

Multilingual Natural Language Processing Applications: From Theory to Practice

Filed under: Multilingual,Natural Language Processing — Patrick Durusau @ 12:35 pm

Multilingual Natural Language Processing Applications: From Theory to Practice by Daniel Bikel and Imed Zitouni.

From the description:

Multilingual Natural Language Processing Applications is the first comprehensive single-source guide to building robust and accurate multilingual NLP systems. Edited by two leading experts, it integrates cutting-edge advances with practical solutions drawn from extensive field experience.

Part I introduces the core concepts and theoretical foundations of modern multilingual natural language processing, presenting today’s best practices for understanding word and document structure, analyzing syntax, modeling language, recognizing entailment, and detecting redundancy.

Part II thoroughly addresses the practical considerations associated with building real-world applications, including information extraction, machine translation, information retrieval/search, summarization, question answering, distillation, processing pipelines, and more.

This book contains important new contributions from leading researchers at IBM, Google, Microsoft, Thomson Reuters, BBN, CMU, University of Edinburgh, University of Washington, University of North Texas, and others.

Coverage includes

Core NLP problems, and today’s best algorithms for attacking them

  • Processing the diverse morphologies present in the world’s languages
  • Uncovering syntactical structure, parsing semantics, using semantic role labeling, and scoring grammaticality
  • Recognizing inferences, subjectivity, and opinion polarity
  • Managing key algorithmic and design tradeoffs in real-world applications
  • Extracting information via mention detection, coreference resolution, and events
  • Building large-scale systems for machine translation, information retrieval, and summarization
  • Answering complex questions through distillation and other advanced techniques
  • Creating dialog systems that leverage advances in speech recognition, synthesis, and dialog management
  • Constructing common infrastructure for multiple multilingual text processing applications

This book will be invaluable for all engineers, software developers, researchers, and graduate students who want to process large quantities of text in multiple languages, in any environment: government, corporate, or academic.

I could not bring myself to buy it for Carol (Mother’s Day) so I will have to wait for Father’s Day (June). 😉

If you get it before then, comments welcome!

May 2, 2012

Natural Language Processing (almost) from Scratch

Filed under: Artificial Intelligence,Natural Language Processing,Neural Networks,SENNA — Patrick Durusau @ 2:18 pm

Natural Language Processing (almost) from Scratch by Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.

Abstract:

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

In the introduction the authors remark:

The overwhelming majority of these state-of-the-art systems address a benchmark task by applying linear statistical models to ad-hoc features. In other words, the researchers themselves discover intermediate representations by engineering task-specifi c features. These features are often derived from the output of preexisting systems, leading to complex runtime dependencies. This approach is e ffective because researchers leverage a large body of linguistic knowledge. On the other hand, there is a great temptation to optimize the performance of a system for a speci fic benchmark. Although such performance improvements can be very useful in practice, they teach us little about the means to progress toward the broader goals of natural language understanding and the elusive goals of Arti ficial Intelligence.

I am not an AI enthusiast but I agree that pre-judging linguistic behavior (based on our own) in a data set will find no more (or less) linguistic behavior than our judgment allows. Reliance on the research of others just adds more opinions to our own. Have you ever wonder on what basis we accept the judgments of others?

A very deep and annotated dive into NLP approaches (the author’s and others) with pointers to implementations, data sets and literature.

In case you are interested, the source code is available at: SENNA (Semantic/syntactic Extraction using a Neural Network Architecture)

April 29, 2012

Text Analytics Summit Europe – highlights and reflections

Filed under: Analytics,Natural Language Processing,Text Analytics — Patrick Durusau @ 2:01 pm

Text Analytics Summit Europe – highlights and reflections by Tony Russell-Rose.

Earlier this week I had the privilege of attending the Text Analytics Summit Europe at the Royal Garden Hotel in Kensington. Some of you may of course recognise this hotel as the base for Justin Bieber’s recent visit to London, but sadly (or is that fortunately?) he didn’t join us. Next time, maybe…

Ranking reasons to attend:

  • #1 Text Analytics Summit Europe – meet other attendees, presentations
  • #2 Kensington Gardens and Hyde Park (been there, it is more impressive than you can imagine)
  • #N +1 Justin Bieber being in London (or any other location)

I was disappointed by the lack of links to slides or videos of the presentations.

Tony’s post does have pointers to people and resources you may have missed.

Question: Do you think “text analytics” and “data mining” are different? If so, how?

April 11, 2012

Reviews and Natural Language Processing: Clustering

Filed under: Clustering,Natural Language Processing — Patrick Durusau @ 6:15 pm

Reviews and Natural Language Processing: Clustering

From the post:

This quote initiated a Natural Language investigation into the HomeAway Review corpus: do the Traveler reviews (of properties) adhere to some set of standards? Reviews contain text and a “star” rating; does the text align with the rating? Analyzing its various corpora with Natural Language Processing tools allows HomeAway to better listen to – and better serve – its customers.

Interesting. Home Away is a vacation rental marketplace and so has a pressing interest in the analysis of reviews.

Promises to be a very good grounding in NLP as applied to reviews. Worth watching closely.

March 23, 2012

Excellent Papers for 2011 (Google)

Filed under: HCIR,Machine Learning,Multimedia,Natural Language Processing — Patrick Durusau @ 7:23 pm

Excellent Papers for 2011 (Google)

Corinna Cortes and Alfred Spector of Google Research have collected up great papers published by Glooglers in 2011.

To be sure there are the obligatory papers on searching and natural language processing but there are also papers on audio processing, human-computer interfaces, multimedia, systems and other topics.

Many of these will be the subjects of separate posts in the future. For now, peruse at your leisure and sing out when you see one of special interest.

March 11, 2012

Challenges of Chinese Natural Language Processing

Filed under: Chinese,Homographs,Natural Language Processing,Segmentation — Patrick Durusau @ 8:10 pm

Thinkudo Labs is posting a series on Chinese natural language processing.

I will be gathering those posts here for ease of reference.

Challenges of Chinese Natural Language Processing – Segmentation

Challenges of Chinese Natural Language Processing – Homograph
(If you are betting this was the post that caught my attention, you are right in one.)

You will need native Chinese speaker assistance for serious Chinese language processing but understanding some of the issues ahead of time won’t hurt.

March 6, 2012

Stanford – Delayed Classes – Enroll Now!

If you have been waiting for notices about the delayed Stanford courses for Spring 2012, your wait is over!

Even if you signed up for more information, you must register at the course webpage to take the course.

Details as I have them on 6 March 2012 (check course pages for official information):

Cryptography Starts March 12th.

Design and Analysis of Algorithms Part 1 Starts March 12th.

Game Theory Starts March 19th.

Natural Language Processing Starts March 12th.

Probabilistic Graphical Models Starts March 19th.

You may be asking yourself, “Are all these courses useful for topic maps?”

I would answer by pointing out that librarians and indexers have rely on a broad knowledge of the world to make information more accessible to users.

By way of contrast, “big data” and Google, have made it less accessible.

Something to think about while you are registering for one or more of these courses!

January 28, 2012

Mavuno: Hadoop-Based Text Mining Toolkit

Filed under: Mahout,Natural Language Processing — Patrick Durusau @ 10:54 pm

Mavuno: A Hadoop-Based Text Mining Toolkit

From the webpage:

Mavuno is an open source, modular, scalable text mining toolkit built upon Hadoop. It supports basic natural language processing tasks (e.g., part of speech tagging, chunking, parsing, named entity recognition), is capable of large-scale distributional similarity computations (e.g., synonym, paraphrase, and lexical variant mining), and has information extraction capabilities (e.g., instance and semantic relation mining). It can easily be adapted to new input formats and text mining tasks.

Just glancing at the documentation I am intrigued by the support for Java regular expressions. More on that this coming week.

I first saw this at myNoSQL.

January 14, 2012

Faster reading through math

Filed under: Data Mining,Natural Language Processing,Searching — Patrick Durusau @ 7:39 pm

Faster reading through math

John Johnson writes:

Let’s face it, there is a lot of content on the web, and one thing I hate worse is reading halfway through an article and realizing that the title and first paragraph indicate little about the rest of the article. In effect, I check out the quick content first (usually after a link), and am disappointed.

My strategy now is to use automatic summaries, which are now a lot more accessible than they used to be. The algorithm has been around since 1958 (!) by H. P. Luhn and is described in books such as Mining the Social Web by Matthew Russell (where a Python implementation is given). With a little work, you can create a program that scrapes text from a blog, provides short and long summaries, and links to the original post, and packages it up in a neat HTML page.

Or you can use the cute interface in Safari, if you care to switch.

The woes of ambiguity!

I jumped to John’s post thinking it had some clever way to read math faster. 😉 Some of the articles I am reading take a lot longer than others. I have one on homology that I am just getting comfortable enough with to post about it.

Any reading assistant tools that you would care to recommend?

Of particular interest would be software that I could feed a list of URLs that resolve to PDFs files (possibly with authentication although I could login to start it off) and it produces a single HTML page summary.

December 29, 2011

International Journal on Natural Language Computing (IJNLC)

Filed under: Natural Language Processing — Patrick Durusau @ 9:17 pm

International Journal on Natural Language Computing (IJNLC)

Dates:

Submission deadline : 25 January, 2012

Acceptance notification: 25 February, 2012
Final manuscript due : 28 February, 2012
Publication date : determined by the Editor-in-Chief

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.

Apparently the first issue of a new journal on natural language processing. Every journal has a first issue so please consider contributing something strong for the first issue of this journal.

December 24, 2011

Natural

Filed under: Lexical Analyzer,Natural Language Processing,node-js — Patrick Durusau @ 4:44 pm

Natural

From the webpage:

“Natural” is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, and some inflection are currently supported.

It’s still in the early stages, and am very interested in bug reports, contributions and the like.

Note that many algorithms from Rob Ellis’s node-nltools are being merged in to this project and will be maintained here going forward.

At the moment most algorithms are English-specific but long-term some diversity is in order.

Aside from this README the only current documentation is here on my blog.

Just in case you are looking for natural language processing capabilities with nodejs.

December 13, 2011

Orev: The Apache OpenRelevance Viewer

Filed under: Crowd Sourcing,Natural Language Processing,Relevance — Patrick Durusau @ 9:50 pm

Orev: The Apache OpenRelevance Viewer

From the webpage:

The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.

These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.

Having no such tool, the Viewer – Orev – is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.

Focuses on judging of documents but that is a common level of granularity these days for relevance.

I don’t know of anything more granular but if you find such a tool, please sing out!

December 11, 2011

tokenising the visible english text of common crawl

Filed under: Cloud Computing,Dataset,Natural Language Processing — Patrick Durusau @ 10:20 pm

tokenising the visible english text of common crawl by Mat Kelcey.

From the post:

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.

Well, 30TB of data, that certainly sounds like a small project. 😉

What small amount of data are you using for your next project?

November 29, 2011

Apache OpenNLP 1.5.2-incubating

Filed under: Natural Language Processing,OpenNLP — Patrick Durusau @ 9:08 pm

From the announcement of the release of Apache OpenNLP 1.5.2-incubating:

The Apache OpenNLP team is pleased to announce the release of version 1.5.2-incubating of Apache OpenNLP.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.

The OpenNLP 1.5.2-incubating binary and source distributions are available for download from our download page: http://incubator.apache.org/opennlp/download.cgi

The OpenNLP library is distributed by Maven Central as well. See the Maven Dependency page for more details: http://incubator.apache.org/opennlp/maven-dependency.html

This release contains a couple of new features, improvements and bug fixes. The maxent trainer can now run in multiple threads to utilize multi-core CPUs, configurable feature generation was added to the name finder, the perceptron trainer was refactored and improved, machine learners can now be configured with much more options via a parameter file, evaluators can print out detailed evaluation information.

Additionally the release contains the following noteworthy changes:

  • Improved the white space handling in the Sentence Detector and its training code
  • Added more cross validator command line tools
  • Command line handling code has been refactored
  • Fixed problems with the new build
  • Now uses fast token class feature generation code by default
  • Added support for BioNLP/NLPBA 2004 shared task data
  • Removal of old and deprecated code
  • Dictionary case sensitivity support is now done properly
  • Support for OSGi

For a complete list of fixed bugs and improvements please see the RELEASE_NOTES file included in the distribution.

November 25, 2011

Natural Language Processing

Filed under: Natural Language Processing — Patrick Durusau @ 4:24 pm

Natural Language Processing with Christopher Manning and Dan Jurafsky.

From the webpage:

We are offering this course on Natural Language Processing free and online to students worldwide, January 23rd – March 18th 2012, continuing Stanford’s exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford’s courses in Natural Language Processing. You will need a decent internet connection for accessing course materials, but should be able to watch the videos on your smartphone.

Course Description

The course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

The class will start January 23 2012, and will last approximately 8 weeks.

If you don’t know any more about natural language processing at the end of March 2012 than you did at New Years, whose fault is that? 😉

November 7, 2011

Stanford NLP

Filed under: Natural Language Processing,Stanford NLP — Patrick Durusau @ 7:29 pm

Stanford NLP

Usually a reference to the Stanford NLP parser but I have put in the link to the “The Stanford Natural Language Processing Group.”

From its webpage:

The Natural Language Processing Group at Stanford University is a team of faculty, research scientists, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. Our work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering.

A distinguishing feature of the Stanford NLP Group is our effective combination of sophisticated and deep linguistic modeling and data analysis with innovative probabilistic and machine learning approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broad-coverage natural-language processing in many languages. These technologies include our part-of-speech tagger, which currently has the best published performance in the world; a high performance probabilistic parser; a competition-winning biological named entity recognition system; and algorithms for processing Arabic, Chinese, and German text.

The Stanford NLP Group includes members of both the Linguistics Department and the Computer Science Department, and is affiliated with the Stanford AI Lab and the Stanford InfoLab.

Quick link to Stanford NLP Software page.

Using Lucene and Cascalog for Fast Text Processing at Scale

Filed under: Cascalog,Clojure,LingPipe,Lucene,Natural Language Processing,OpenNLP,Stanford NLP — Patrick Durusau @ 7:29 pm

Using Lucene and Cascalog for Fast Text Processing at Scale

From the post:

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure’s awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

The world of text exploration just gets better all the time!

November 6, 2011

End-to-end NLP packages

Filed under: Natural Language Processing — Patrick Durusau @ 5:42 pm

End-to-end NLP packages

From the post:

What freely available end-to-end natural language processing (NLP) systems are out there, that start with raw text, and output parses and semantic structures? Lots of NLP research focuses on single tasks at a time, and thus produces software that does a single task at a time. But for various applications, it is nicer to have a full end-to-end system that just runs on whatever text you give it.

If you believe this is a worthwhile goal (see caveat at bottom), I will postulate there aren’t a ton of such end-to-end, multilevel systems. Here are ones I can think of. Corrections and clarifications welcome.

Brendan O’Connor provides a nice listing of end-to-end NLP packages. One or more may be useful in the creation of topic maps based on large amounts of textual data.

November 1, 2011

Natural Language Processing from Scratch

Filed under: Natural Language Processing,Neural Networks — Patrick Durusau @ 3:32 pm

Natural Language Processing from Scratch

From the post:

Ronan's masterpiece, "Natural Language Processing (Almost) from Scratch", has been published in JMLR. This paper describes how to use a unified neural network architecture to solve a collection of natural language processing tasks with near state-of-the-art accuracies and ridiculously fast processing speed. A couple thousand lines of C code processes english sentence at more than 10000 words per second and outputs part-of-speech tags, named entity tags, chunk boundaries, semantic role labeling tags, and, in the latest version, syntactic parse trees. Download SENNA!

This looks very cool! Check out the paper along with the software!

October 26, 2011

LingPipe and Text Processing Books

Filed under: Java,LingPipe,Natural Language Processing — Patrick Durusau @ 6:57 pm

LingPipe and Text Processing Books

From the website:

We’ve decided to split what used to be the monolithic LingPipe book in two. As they’re written, we’ll be putting up drafts here.

NLP with LingPipe

You can download the PDF of the LingPipe book here:

Carpenter, Bob and Breck Baldwin. 2011. Natural Language Processing with LingPipe 4. Draft 0.5. June 2011. [Download: lingpipe-book-0.5.pdf]

Text Processing with Java

The PDF of the book on text in Java is here:

Carpenter, Bob, Mitzi Morris, and Breck Baldwin. 2011. Text Processing with Java 6. Draft 0.5. June 2011. [Download: java-text-book-0.5.pdf]

The pages are 7 inches by 10 inches, so if you print, you have the choice of large margins (no scaling) or large print (print fit to page).

Source code is also available.

October 25, 2011

collocations in wikipedia, part 1

Filed under: Collocation,Natural Language Processing — Patrick Durusau @ 7:34 pm

collocations in wikipedia, part 1

From the post:

collocations are combinations of terms that occur together more frequently than you’d expect by chance.

they can include

  • proper noun phrases like ‘Darth Vader’
  • stock/colloquial phrases like ‘flora and fauna’ or ‘old as the hills’
  • common adjectives/noun pairs (notice how ‘strong coffee’ sounds ok but ‘powerful coffee’ doesn’t?)

let’s go through a couple of techniques for finding collocations taken from the exceptional nlp text “foundations of statistical natural language processing” by manning and schutze.

Looks like the start of a very interesting series on collocation (statistical) in Wikipedia. Which is a serious data set for training purposes.

BTW, don’t miss the homepage. Lots of interesting resources.


Update: 18 November 2011

See also:

collocations in wikipedia, part 2

finding phrases with mutual information [collocations, part 3]

I am making a separate blog post on parts 2 and 3 but just in case you come here first…. Enjoy!

October 22, 2011

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

Filed under: Hadoop,Natural Language Processing,Pig — Patrick Durusau @ 3:16 pm

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

One problem with after-the-fact assignment of semantics to text is that the volume of text involved (usually) is too great for manual annotation.

This post walks you through the alternative of using automated annotation based upon Wikipedia content.

From the post:

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).

This is also an opportunity to try out cloud based computing if you are so inclined.

October 15, 2011

RadioVision: FMA Melds w Echo Nest’s Musical Brain

Filed under: Data Mining,Machine Learning,Natural Language Processing — Patrick Durusau @ 4:28 pm

RadioVision: FMA Melds w Echo Nest’s Musical Brain

From the post:

The Echo Nest has indexed the Free Music Archive catalog, integrating the most incredible music intelligence platform with the finest collection of free music.

The Echo Nest has been called “the most important music company on Earth” for good reason: 12 years of research at UC Berkeley, Columbia and MIT factored into the development of their “musical brain.” The platform combines large-scale data mining, natural language processing, acoustic analysis and machine learning to automatically understand how the online world describes every artist, extract musical attributes like tempo and time signature, learn about music trends (see: “hotttnesss“), and a whole lot more. Echo Nest then shares all of this data through a free and open API. [read more here]

Add music to your topic map!

« Newer PostsOlder Posts »

Powered by WordPress