Natural Language Processing « Another Word For It

September 15, 2011

Statistical machine learning for text classification

Filed under: Natural Language Processing,NLTK,Python — Patrick Durusau @ 7:51 pm

Statistical machine learning for text classification with scikit-learn and NLTK by Olivier Grisel. (PyCon 2011)

The goal of this talk is to give a state-of-the-art overview of machine learning algorithms applied to text classification tasks ranging from language and topic detection in tweets and web pages to sentiment analysis in consumer products reviews.

First third is a review of basic NLP. Review of basic functions of scikit-learn. Same for NLTK. Also covers, briefly, the Google Prediction API.

Compares all three on the movie review database. Discusses analysis of newsgroups (for topics) and identifying language of webpages.

I would not say “state-of-the-art” as much as “an intro to text classification and its potential.”

Comments Off

August 17, 2011

Recent Advances in Literature Based Discovery

Filed under: Information Retrieval,Literature-based Discovery,Natural Language Processing — Patrick Durusau @ 6:52 pm

Recent Advances in Literature Based Discovery

Abstract:

Literature Based Discovery (LBD) is a process that searches for hidden and important connections among information embedded in published literature. Employing techniques from Information Retrieval and Natural Language Processing, LBD has potential for widespread application yet is currently implemented primarily in the medical domain. This article examines several published LBD systems, comparing their descriptions of domain and input data, techniques to locate important concepts from text, models of discovery, experimental results, visualizations, and evaluation of the results. Since there is no comprehensive “gold standard, ” or consistent formal evaluation methodology for LBD systems, the development and usage of effective metrics for such systems is also discussed, providing several options. Also, since LBD is currently often time-intensive, requiring human input at one or more points, a fully-automated system will enhance the efficiency of the process. Therefore, this article considers methods for automated systems based on data mining.

Not “recent” now because the paper dates from 2006 but it is a good overview of Literature Based Discovery (LBD) at the time.

Comments Off

July 3, 2011

SwiftRiver/Ushahidi

Filed under: Filters,Linguistics,Natural Language Processing,NoSQL,Python — Patrick Durusau @ 7:34 pm

SwiftRiver

From the Get Started page:

The mission of the SwiftRiver initiative is to democratize access to the tools used to make sense of data.

To achieve this goal we’ve taken two approaches, apps and APIs. Apps are user facing and should be tools that are easy to understand, deploy and use. APIs are machine facing and extract meta-context that other machines (apps) use to convey information to the end user.

SwiftRiver is an opensource platform that aims to allow users to do three things well: 1) structure unstructured data feeds, 2) filter and prioritize information conditionally and 3) add context to content. Doing these things well allows users to pull in real-time content from Twitter, SMS, Email or the Web and to make sense of data on the fly.

The Ushahidi logo at the top will take you to a common wiki for Ushahidi and SwithRiver.

And the Ushahidi link in text takes you to: Ushahidi:

We are a non-profit tech company that develops free and open source software for information collection, visualization and interactive mapping.

Home of:

Ushahidi Platform: We built the Ushahidi platform as a tool to easily crowdsource information using multiple channels, including SMS, email, Twitter and the web.

SwiftRiver: SwiftRiver is an open source platform that aims to democratize access to tools for filtering & making sense of real-time information.

Crowdmap: When you need to get the Ushahidi platform up in 2 minutes to crowdsource information, Crowdmap will do it for you. It’s our hosted version of the Ushahidi platform.

It occurs to me that mapping email feeds would fit right into my example in Marketing What Users Want…And An Example.

Comments Off

June 17, 2011

Natural Language Processing with Hadoop and Python

Filed under: Hadoop,Natural Language Processing,Python — Patrick Durusau @ 7:19 pm

Natural Language Processing with Hadoop and Python

From the post:

If you listen to analysts talk about complex data, they all agree, it’s growing, and faster than anything else before. Complex data can mean a lot of things, but to our research group, ever increasing volumes of naturally occurring human text and speech—from blogs to YouTube videos—enable new and novel questions for Natural Language Processing (NLP). The dominating characteristic of these new questions involves making sense of lots of data in different forms, and extracting useful insights.

Now that I think about it, a lot of the input from various intelligence operations consists of “naturally occurring human text and speech….” Anyone can crunch lots of text/speech, the question is being a good enough analyst to extract something useful.

Comments Off

May 19, 2011

Kill Math

Filed under: Interface Research/Design,Language,Natural Language Processing,Semantics — Patrick Durusau @ 3:27 pm

Kill Math

Bret Victor writes:

The power to understand and predict the quantities of the world should not be restricted to those with a freakish knack for manipulating abstract symbols.

When most people speak of Math, what they have in mind is more its mechanism than its essence. This “Math” consists of assigning meaning to a set of symbols, blindly shuffling around these symbols according to arcane rules, and then interpreting a meaning from the shuffled result. The process is not unlike casting lots.

This mechanism of math evolved for a reason: it was the most efficient means of modeling quantitative systems given the constraints of pencil and paper. Unfortunately, most people are not comfortable with bundling up meaning into abstract symbols and making them dance. Thus, the power of math beyond arithmetic is generally reserved for a clergy of scientists and engineers (many of whom struggle with symbolic abstractions more than they’ll actually admit).

We are no longer constrained by pencil and paper. The symbolic shuffle should no longer be taken for granted as the fundamental mechanism for understanding quantity and change. Math needs a new interface.

A deeply interesting post that argues that Math needs a new interface, one more accessible to more people.

Since computers can present mathematical concepts and operations in visual representations.

Ironic the same computers gave rise to impoverished and difficult to use (for most people) representations of semantics.

Moving away from the widely adopted, easy to use and flexible representations of semantics in natural languages.

Do we need an old interface for semantics?

Comments Off

May 5, 2011

The 5th International Joint Conference on Natural Language Processing (IJCNLP2011)

Filed under: Conferences,Natural Language Processing — Patrick Durusau @ 1:50 pm

The 5th International Joint Conference on Natural Language Processing (IJCNLP2011)

May 20, 2011, submission deadline

From the announcement:

The 5th International Joint Conference on Natural Language Processing, organized by the Asian Federation of Natural Language Processing will be held in Chiang Mai, Thailand on November 8-13, 2011. The conference will cover a broad spectrum of technical areas related to natural language and computation. IJCNLP 2011 will include full papers, short papers, oral presentations, poster presentations, demonstrations, tutorials, and workshops.

Comments Off

April 30, 2011

CS 533: Natural Language Processing

Filed under: Legal Informatics,Natural Language Processing — Patrick Durusau @ 10:20 am

CS 533: Natural Language Processing

Don’t be mis-led by the title! This isn’t just another NLP course.

From the Legal Informatics Blog:

Professor Dr. L. Thorne McCarty of the Rutgers University Department of Computer Science has posted lecture videos and other materials in connection with his recent graduate course on Natural Language Processing. The course uses examples from a judicial decision: Carter v. Exxon Company USA, 177 F.3d 197 (3d Cir. 1999).

From Professor McCarty’s post on LinkedIn:

To access most of the files, you will need the username: cs533 and the password: shrdlu. To access the videos, use the same password: shrdlu. Comments are welcome!

NLP and legal materials? Now there is a winning combination!

I must confess that years of practicing law and even now continuing to read legal materials in some areas may be influencing my estimate of the appeal of this course. 😉 I will watch the lectures and get back to you.

Comments Off

Natural Language Processing for the Working Programmer

Filed under: Haskell,Natural Language Processing — Patrick Durusau @ 10:18 am

Natural Language Processing for the Working Programmer

Daniël de Kok and Harm Brouwer have started a book on natural language processing using Haskell.

Functional programming meets NLP!

A work in progress so I am sure the authors would appreciate comments, suggestions, etc.

BTW, there is a blog, ? Try ?t H?ske?? in ?inguistics with posts working through the book.

Comments Off

Artificial Intelligence | Natural Language
Processing – Videos

Filed under: Artificial Intelligence,Natural Language Processing — Patrick Durusau @ 10:17 am

I first blogged about Christopher Manning’s Artificial Intelligence | Natural Language Processing back in February, 2011.

Video’s of the lectures are now online and I thought that merited a separate update to that entry.

I have also edited that entry to point to the videos.

Enjoy!

Comments Off

April 24, 2011

It’s All Semantic With the New Text-Processing API

Filed under: Natural Language Processing,Text Analytics — Patrick Durusau @ 5:34 pm

It’s All Semantic With the New Text-Processing API

From the post at ProgrammableWeb:

Now I don’t have a master’s degree in Natural language processing, and you just might need one to get your hands dirty with this API. I see the text-processing.com API as offering a mid-level utility for incorporation in a web app. You might take text samples from your source, feed them through the Text-Processing API and analyze those results a bit further before presenting anything to your user.

This offering appears to be the result of a one man effort. Jacob Perkins designed his API as RESTful with JSON responses. It’s free and open for the meantime, but it sounds like Perkins may polish the service a bit and start charging for access. There could be a real market here since only a handful of the 58 semantic APIs in our directory offer results at the technical level.

Interesting and not all that surprising.

Could be a good way to see if you are interested in going further with natural language processing.

Comments Off

April 5, 2011

The role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text

Filed under: Information Retrieval,Natural Language Processing — Patrick Durusau @ 4:29 pm

The role of Natural Language Processing in Information Retrieval: Searching for Meaning in Text by Tony Russell-Rose.

Abstract:

Here are the slides from the talk I gave at City University last week, as a guest lecture to their Information Science MSc students. It’s based on the chapter of the same name which I co-authored with Mark Stevenson of Sheffield University and appears in the book called “Information Retrieval: Searching in the 21st Century“. The session was scheduled for 3 hours, and to my amazement, required all of that (thanks largely to an enthusiastic group who asked lots of questions). And no, I didn’t present 3 hours of Powerpoint – the material was punctuated with practical exercises and demos to illustrate the learning points and allow people to explore the key concepts for themselves. These exercises aren’t included in the Slideshare version, but I am happy to make them available to folks who want to enjoy the full experience.

If you don’t look at another presentation slide deck this week, do yourself a favor and look at this one. Very well done.

I’m going to write for the exercises. Comments to follow.

Comments Off

March 14, 2011

Graph-based Algorithms….

Filed under: Graphs,Information Retrieval,Natural Language Processing — Patrick Durusau @ 7:50 am

Graph-based Algorithms for Information Retrieval and Natural Language Processing

Tutorial at HLT/NAACL 2006 (June 4, 2006)

Rada Mihalcea and Dragomir Radev

From the slides:

Motivation

Graph-theory is a well studied discipline
So are the fields of Information Retrieval (IR) and Natural Language Processing (NLP)
Often perceived as completely different disciplines

Goal of the tutorial: provide an overview of method and applications in IR and NLP that rely on graph-based algorithms, e.g.
- Graph-based algorithms: graph traversal, min-cut algorithms, random walks
- Applied to: Web search, text understanding, text summarization, keyword extraction, text clustering

Nice introduction to graph-theory and why we should care. A lot.

Comments Off

TextGraphs-6: Graph-based Methods for Natural Language Processing

Filed under: Conferences,Graphs,Natural Language Processing — Patrick Durusau @ 7:47 am

TextGraphs-6: Graph-based Methods for Natural Language Processing

From the website:

TextGraphs is at its SIXTH edition! This shows that two seemingly distinct disciplines, graph theoretic models and computational linguistics, are in fact intimately connected, with a large variety of Natural Language Processing (NLP) applications adopting efficient and elegant solutions from graph-theoretical framework. The TextGraphs workshop series addresses a broad spectrum of research areas and brings together specialists working on graph-based models and algorithms for NLP and computational linguistics, as well as on the theoretical foundations of related graph-based methods. This workshop series is aimed at fostering an exchange of ideas by facilitating a discussion about both the techniques and the theoretical justification of the empirical results among the NLP community members.

Special Theme: “Graphs in Structured Input/Output Learning”

Recent work in machine learning has provided interesting approaches to globally represent and process structures, e.g.:

graphical models, which encode observations, labels and their dependencies as nodes and edges of graphs

kernel-based machines which can encode graphs with structural kernels in the learning; algorithms

SVM-struct and other max margin methods and the structured perceptron that allow for outputting entire structures like for example graphs

Important dates:

April 1, 2011 Submission deadline
April 25th, 2011 Notification of acceptance
May 6th, 2011 Camera-ready copies due
June 23th, 2011 Textgraphs workshop at ACL-HLT 2011

As if Neo4J and Gremlin weren’t enough of an incentive to be interested in graph approaches. 😉

Comments Off

Sixth International Conference on Knowledge Capture – K-Cap 2011

Filed under: Artificial Intelligence,Conferences,Knowledge Capture,Machine Learning,Natural Language Processing,Semantic Web — Patrick Durusau @ 7:38 am

Sixth International Conference on Knowledge Capture – K-Cap 2011

From the website:

In today’s knowledge-driven world, effective access to and use of information is a key enabler for progress. Modern technologies not only are themselves knowledge-intensive technologies, but also produce enormous amounts of new information that we must process and aggregate. These technologies require knowledge capture, which involve the extraction of useful knowledge from vast and diverse sources of information as well as its acquisition directly from users. Driven by the demands for knowledge-based applications and the unprecedented availability of information on the Web, the study of knowledge capture has a renewed importance.

Researchers that work in the area of knowledge capture traditionally belong to several distinct research communities, including knowledge engineering, machine learning, natural language processing, human-computer interaction, artificial intelligence, social networks and the Semantic Web. K-CAP 2011 will provide a forum that brings together members of disparate research communities that are interested in efficiently capturing knowledge from a variety of sources and in creating representations that can be useful for reasoning, analysis, and other forms of machine processing. We solicit high-quality research papers for publication and presentation at our conference. Our aim is to promote multidisciplinary research that could lead to a new generation of tools and methodologies for knowledge capture.

Conference:

25 – 29 June 2011
Banff Conference Centre
Banff, Alberta, Canada

Call for papers has closed. Will try to post a note about the conference earlier next year.

Proceedings from previous conferences available through the ACM Digital Library – Knowledge Capture.

Let me know if you have trouble with the ACM link. I sometimes don’t get removal of all the tracing cruft off of URLs correct. There really should be a “clean” URL option for sites like the ACM.

Comments Off

March 4, 2011

Metaoptimize Q+A

Filed under: Artificial Intelligence,Data Mining,Information Retrieval,Machine Learning,Natural Language Processing,Visualization — Patrick Durusau @ 6:12 am

Metaoptimize Q+A is one of the Q/A sites I just stumbled across.

From the website:

A community of scientists interested in machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization, as well as adjacent topics.

Looks like an interesting place to hang out.

Comments Off

February 13, 2011

How It Works – The “Musical Brain”

Filed under: Data Source,Music Retrieval,Natural Language Processing — Patrick Durusau @ 1:46 pm

How It Works – The “Musical Brain”

I found this following the links in the Million Song Dataset post.

One aspect, among others, that I found interesting, was the support for multiple ID spaces.

I am curious about the claim it works by:

Analyzing every song on the web to extract key, tempo, rhythm and timbre and other attributes — understanding every song in the same way a musician would describe it

Leaving aside the ambitious claims about NLP processing made elsewhere on that page, I find it curious that there is a uniform method for describing music.

Or perhaps they mean that the “Musical Brain” uses only one description uniformly across the music it evaluates. I can buy that. And it could well be a useful exercise.

At least from the prospective of generating raw data that could then be mapped to other nomenclatures used by musicians.

I wonder if the Rolling Stone uses the same nomenclature as the “Musical Brain?” Will have to check.

Suggestions for other music description languages? Mappings to the one used by the “Musical Brain?”

BTW, before I forget, the “Musical Brain” offers a free API (for non-commercial use) to its data.

Would appreciate hearing about your experiences with the API.

Comments Off

February 11, 2011

MILK: Machine Learning in Python

Filed under: Natural Language Processing,Software — Patrick Durusau @ 1:12 pm

MILK: Machine Learning in Python

From the website:

Milk is a machine learning toolkit in Python.

Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.

For unsupervised learning, milk supports k-means clustering and affinity propagation.

Milk is flexible about its inputs. It optimised for numpy arrays, but can often handle anything (for example, for SVMs, you can use any dataype and any kernel and it does the right thing).

There is a strong emphasis on speed and low memory usage. Therefore, most of the performance sensitive code is in C++. This is behind Python-based interfaces for convenience.

Another NLP tool for your topic map construction toolkit.

I need to work on creating a listing for such tools by features and capacity, to make it easier to find the tool necessary for some particular project.

Comments Off

February 7, 2011

KEA: keyphrase extraction algorithm

Filed under: Entity Extraction,Natural Language Processing — Patrick Durusau @ 7:59 am

KEA: keyphrase extraction algorithm

From the website:

Keywords and keyphrases (multi-word units) are widely used in large document collections. They describe the content of single documents and provide a kind of semantic metadata that is useful for a wide variety of purposes. The task of assigning keyphrases to a document is called keyphrase indexing. For example, academic papers are often accompanied by a set of keyphrases freely chosen by the author. In libraries professional indexers select keyphrases from a controlled vocabulary (also called Subject Headings) according to defined cataloguing rules. On the Internet, digital libraries, or any depositories of data (flickr, del.icio.us, blog articles etc.) also use keyphrases (or here called content tags or content labels) to organize and provide a thematic access to their data.

KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary.

Given the indexing roots of topic maps, this software is definitely a contender for use in topic map construction.

Comments Off

Weka – Data Mining

Filed under: Mahout,Natural Language Processing — Patrick Durusau @ 7:10 am

Weka

From the website:

Weka 3: Data Mining Software in Java

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

I would say it is under active development/use since the mailing list archives have an average of about 315 posts per month.

Yes, approximately 315 post per month.

Another tool for your topic map toolbox!

Comments (2)

GATE: General Architecture for Text Engineering

Filed under: Entity Extraction,Natural Language Processing — Patrick Durusau @ 7:06 am

GATE: General Architecture for Text Engineering

From the website:

GATE is…

open source software capable of solving almost any text processing problem

a mature and extensive community of developers, users, educators, students and scientists

a defined and repeatable process for creating robust and maintainable text processing workflows

in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation

the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.

used by corporations, SMEs, research labs and Universities worldwide

the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining

a world-class team of language processing developers

If you need to solve a problem with text analysis or human language processing you’re in the right place.

I suppose there is something to be said for an abundance of confidence. 😉

Seriously, this is a very complex and impressive effort.

I will be covering specific tools and aspects of this effort as they relate to topic maps.

Comments Off

February 4, 2011

Artificial Intelligence | Natural Language Processing (Topic Maps by Problem Solving)

Filed under: Natural Language Processing,Topic Maps — Patrick Durusau @ 10:27 am

Artificial Intelligence | Natural Language Processing Stanford course with Christopher D. Manning.

From the website:

This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, and parsing. Also, it examines and constructs representative systems.

~~Only the lecture notes, quizzes, etc. are available.~~ Update: 29 April 2011 – Lecture notes, quizzes, and Video’s of the lectures are online.

~~Still, quite an interesting resource.~~

I am particularly interested in Manning’s approach of not building the class around an edifice to be mastered but rather around problems to be solved.

As primarily a theorist that is rather disturbing but at the same time, it is strangely attractive.

Wondering what a topic map class would look like that started with two or even three related but distinct data sets?

The sort of data sets that lead to topic maps and to walk through what problems we want to solve and unfold topic maps along the way.

Would be an opportunity to use other software, indexing software for example, to see how they compare with topic maps or can be used in their construction.

Thoughts, suggestions, comments?

Comments (1)

February 3, 2011

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

Filed under: Authoring Topic Maps,Bayesian Models,Data Mining,Entity Extraction,Hidden Markov Model,Latent Dirichlet Allocation (LDA),Machine Learning,Mahout,Natural Language Processing,Software — Patrick Durusau @ 3:54 pm

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

Comments (1)

February 2, 2011

Mapping Wikileaks’ Cablegate using Python, mongoDB and Gephi – Saturday, 5 Feburary 2011

Filed under: Gephi,MongoDB,Natural Language Processing — Patrick Durusau @ 10:34 am

Mapping Wikileaks’ Cablegate using Python, mongoDB and Gephi

From the website:

Text analysis and graph visualization on the Wikileaks Cablegate dataset.

We propose to present a complete work-flow of textual data analysis, from acquisition to visual exploration of a complex network. Through the presentation of a simple software specifically developed for this talk, we will cover a set of productive and widely used softwares and libraries in text analysis, then introduce some features of Gephi, an open-source network visualization & analysis software, using the data collected and transformed with cablegate-semnet.

See: cablegate-semnet

If you are in (or can be) Brussels, Belgium this coming Saturday and Sunday, don’t miss this presentation!

There will be many others worthy of your attention as well.

Comments (1)

January 28, 2011

NLP (Natural Language Processing) tools

Filed under: Authoring Topic Maps,Natural Language Processing,Topic Models — Patrick Durusau @ 7:50 am

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

From Stanford University.

It may not be every NLP resource but it is the place to start looking if you are looking for a new tool.

This should give you an idea of the range of tools that could be applied to the AF war diaries for example.

Comments Off

January 7, 2011

openNLP

Filed under: Data Mining,Natural Language Processing — Patrick Durusau @ 2:13 pm

openNLP

From the website:

OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.

OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package.

OpenNLP is incubating at the Apache Software Foundation (ASF).

Another set of NLP tools for topic map authors.

Comments Off

January 3, 2011

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation – Post

Filed under: Duplicates,Natural Language Processing,Similarity,String Matching — Patrick Durusau @ 3:03 pm

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

Good coverage of tokenization of tweets and the use of the Jaccard Distance measure to determine similarity.

Of course, for a topic map, similarity may not lead to being discarded but trigger other operations instead.

Comments Off

December 31, 2010

ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics

Filed under: Natural Language Processing — Patrick Durusau @ 5:35 am

ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics

Association for Computational Linguistics collection of over 19,100 (as of 2010-12-31) papers on computational linguistics.

Questions:

Assume that you have this resource, CiteSeer, DBLP and several others.

If copying of the underlying data isn’t possible/feasible, how would you search across these resources?
What steps would you take to gather at least similar if not the same materials together? (Hint: The correct answer is not use a topic map. Or at least it isn’t a complete answer. What techniques would you use with a topic map?)
What steps would you take to make improvements to mappings between resources or concepts available to other users?

Comments Off

RANLP 2011: Recent Advances In Natural Language Processing

Filed under: Conferences,Natural Language Processing — Patrick Durusau @ 4:27 am

RANLP 2011: Recent Advances In Natural Language Processing Augusta SPA hotel, September 10-16, 2011, Hissar, Bulgaria

Call for Papers:

We invite papers reporting on recent advances in all aspects of Natural Language Processing (NLP). We encourage the representation of a broad range of areas including but not limited to the following: pragmatics, discourse, semantics, syntax, and the lexicon; phonetics, phonology, and morphology; mathematical models and complexity; text understanding and generation; multilingual NLP; machine translation, machine-aided translation, translation memory systems, translation aids and tools; corpus-based language processing; POS tagging; parsing; electronic dictionaries; knowledge acquisition; terminology; word-sense disambiguation; anaphora resolution; information retrieval; information extraction; text summarisation; term recognition; text categorisation; question answering; textual entailment; visualisation; dialogue systems; speech processing; computer-aided language learning; language resources; evaluation; and theoretical and application-oriented papers related to NLP.

Important Dates:

Conference paper submission notification: 11 April 2011
Conference paper submission deadline: 18 April 2011
Conference paper acceptance notification: 15 June 2011
Camera-ready versions of the conference papers: 20 July 2011

The proceedings from RANLP 2009 are typical for the conference.

Comments Off

December 27, 2010

Python Text Processing with NLTK2.0 Cookbook – Review Forthcoming!

Filed under: Classification,Data Analysis,Data Mining,Natural Language Processing — Patrick Durusau @ 2:25 pm

Just a quick placeholder to say that I am reviewing Python Text Processing with NLTK2.0 Cookbook

Python Text Processing

I should have the review done in the next couple of weeks.

In the longer term I will be developing a set of notes on the construction of topic maps using this toolkit.

While you wait for the review, you might enjoy reading: Chapter No.3 – Creating Custom Corpora (free download).

Comments (1)

December 4, 2010

Python Text Processing with NLTK Cookbook

Filed under: Clustering,Co-Words,Corpus Linguistics,Data Mining,Full-Text Search,Linguistic Metadata,Natural Language Processing,Text Analytics — Patrick Durusau @ 7:01 pm

Python Text Processing with NLTK Cookbook by Jacob Perkins.

Contents:

Chapter 1: Tokenizing Text and WordNet Basics

Chapter 2: Replacing and Correcting Words

Chapter 3: Creating Custom Corpora

Chapter 4: Part-of-Speech Tagging

Chapter 5: Extracting Chunks

Chapter 6: Transforming Chunks and Trees

Chapter 7: Text Classification

Chapter 8: Distributed Processing and Handling Large Datasets

Chapter 9: Parsing Specific Data

Appendix: Penn Treebank Part-of-Speech Tags

Index

A sample chapter, Chapter 3: Creating Custom Corpora is available for downloading.

Please post a link to your review of this work.

Even better, send me a copy and I will post a review. (I’m listed on Amazon.)

Comments (4)

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 15, 2011

August 17, 2011

July 3, 2011

June 17, 2011

May 19, 2011

May 5, 2011

April 30, 2011

April 24, 2011

April 5, 2011

March 14, 2011

March 4, 2011

February 13, 2011

February 11, 2011

February 7, 2011

February 4, 2011

February 3, 2011

February 2, 2011

January 28, 2011

January 7, 2011

January 3, 2011

December 31, 2010

December 27, 2010

December 4, 2010