Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 20, 2011

WordNet Data > 10.3 Billion Unique Values

Filed under: Dataset,Linguistics,WordNet — Patrick Durusau @ 8:08 pm

WordNet Data > 10.3 Billion Unique Values

Wanted to draw your attention to some WordNet data files.

From the readme.TXT file in the directory:

As of August 19, 2011 pairwise measures for all nouns using the path measure are available. This file is named WordNet-noun-noun-path-pairs.tar. It is approximately 120 GB compressed. In this file you will find 146,312 files, one for each noun sense. Each file consists of 146,313 lines, where each line (except the first) contains a WordNet noun sense and the similarity to the sense featured in that particular file. Doing the math here, you find that each .tar file contains
about 21,000,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have around 10 billion unique values.

We are currently running wup, res, and lesk, but do not have an estimated date of availability yet.

BTW, on verb data:

These files were created with WordNet::Similarity version 2.05 using WordNet 3.0. They show all the pairwise verb-verb similarities found in WordNet according to the path, wup, lch, lin, res, and jcn measures. The path, wup, and lch are path-based, while res, lin, and jcn are based on information content.

As of March 15, 2011 pairwise measures for all verbs using the six measures above are availble, each in their own .tar file. Each *.tar file is named as WordNet-verb-verb-MEASURE-pairs.tar, and is approx 2.0 – 2.4 GB compressed. In each of these .tar files you will find 25,047 files, one for each verb sense. Each file consists of 25,048 lines, where each line (except the first) contains a WordNet verb sense and the similarity to the sense featured in that particular file. Doing
the math here, you find that each .tar file contains about 625,000,000 pairwise similarity values. Note that these are symmetric (sim (A,B) = sim (B,A)) so you have a bit more than 300 million unique values.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress