Archive for the ‘WordNet’ Category

WordNet RDF

Wednesday, April 16th, 2014

WordNet RDF

From the webpage:

WordNet is supported by the National Science Foundation under Grant Number 0855157. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the creators of WordNet and do not necessarily reflect the views of the National Science Foundation.

This is the RDF version of WordNet, created by mapping the existing WordNet data into RDF. The data is structured according to the lemon model. In addition, links have been added from the following sources:

These links increase the usefulness of the WordNet data. If you would like to contribute extra linking to WordNet please Contact us.

Curious if you find it easier to integrate WordNet RDF with other data or the more traditional WordNet?

I first saw this in a tweet by Bob DuCharme.

NLTK 2.3 – Working with Wordnet

Friday, April 12th, 2013

NLTK 2.3 – Working with Wordnet by Vsevolod Dyomkin.

From the post:

I’m a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn’t mean that work on CL-NLP has stopped – I’ve just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.

Today we’ll start looking at Chapter 2, but we’ll do it from the end, first exploring the topic of Wordnet.

Vsevolod more than makes up for his absence with his post on Wordnet.

As a sample, consider this graphic of the potential of Wordnet:

Wordnet schema

Pay particular attention to the coverage of similarity measures.

Enjoy!

A New Representation of WordNet® using Graph Databases

Monday, March 4th, 2013

A New Representation of WordNet® using Graph Databases by Khaled Nagi.

Abstract:

WordNet® is one of the most important resources in computation linguistics. The semantically related database of English terms is widely used in text analysis and retrieval domains, which constitute typical features, employed by social networks and other modern Web 2.0 applications. Under the hood, WordNet® can be seen as a sort of read-only social network relating its language terms. In our work, we implement a new storage technique for WordNet® based on graph databases. Graph databases are a major pillar of the NoSQL movement with lots of emerging products, such as Neo4j. In this paper, we present two Neo4j graph storage representations for the WordNet® dictionary. We analyze their performance and compare them to other traditional storage models. With this contribution, we also validate the applicability of modern graph databases in new areas beside the typical large-scale social networks with several hundreds of millions of nodes.

Finally, a paper that covers “moderate size databases!”

Think about the average graph database you see on this blog. Not really in the “moderate” range, even though a majority of users work in the moderate range.

Compare the number of Facebook size enterprises with the number of enterprises generally.

Not dissing super-sized graph databases or research on same. I enjoy both a lot.

But for your average customer, experience with “moderate size databases” may be more immediately relevant.

I first saw this in a tweet from Peter Neubauer.

FreeLing 3.0 – An Open Source Suite of Language Analyzers

Sunday, June 3rd, 2012

FreeLing 3.0 – An Open Source Suite of Language Analyzers

Features:

Main services offered by FreeLing library:

  • Text tokenization
  • Sentence splitting
  • Morphological analysis
  • Suffix treatment, retokenization of clitic pronouns
  • Flexible multiword recognition
  • Contraction splitting
  • Probabilistic prediction of unkown word categories
  • Named entity detection
  • Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)
  • PoS tagging
  • Chart-based shallow parsing
  • Named entity classification
  • WordNet based sense annotation and disambiguation
  • Rule-based dependency parsing
  • Nominal correference resolution

[Not all features are supported for all languages, see Supported Languages.]

TOC for the user manual.

Something for your topic map authoring toolkit!

(Source: Jack Park)

Semantic Web – Sweet Spot(s) and ‘Gold Standards’

Monday, January 23rd, 2012

Mike Bergman posted a two-part series on how to make the Semantic Web work:

Seeking a Semantic Web Sweet Spot

In Search of ‘Gold Standards’ for the Semantic Web

Both are worth your time to read but the second sets the bar for “Gold Standards” for the Semantic Web as:

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

There is one criteria that Mike leaves out: Choice of a majority of users.

Use by a majority of users is a sweet spot that brooks no argument.

WordNet Data > 10.3 Billion Unique Values

Saturday, August 20th, 2011

WordNet Data > 10.3 Billion Unique Values

Wanted to draw your attention to some WordNet data files.

From the readme.TXT file in the directory:

As of August 19, 2011 pairwise measures for all nouns using the path measure are available. This file is named WordNet-noun-noun-path-pairs.tar. It is approximately 120 GB compressed. In this file you will find 146,312 files, one for each noun sense. Each file consists of 146,313 lines, where each line (except the first) contains a WordNet noun sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion unique values.

We are currently running wup, res, and lesk, but do not have an estimated date of availability yet.

BTW, on verb data:

These files were created with WordNet::Similarity version 2.05 using WordNet 3.0. They show all the pairwise verb-verb similarities found in WordNet according to the path, wup, lch, lin, res, and jcn measures. The path, wup, and lch are path-based, while res, lin, and jcn are based on information content.

As of March 15, 2011 pairwise measures for all verbs using the six measures above are availble, each in their own .tar file. Each *.tar file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx 2.0 – 2.4 GB compressed. In each of these .tar files you will find 25,047 files, one for each verb sense. Each file consists of 25,048 lines, where each line (except the first) contains a WordNet verb sense and the similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have a bit more than 300 million unique values.