Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 22, 2013

PPDB: The Paraphrase Database

Filed under: Computational Linguistics,Linguistics,Natural Language Processing — Patrick Durusau @ 2:46 pm

PPDB: The Paraphrase Database by Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch.

Abstract:

We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.

A resource that should improve your subject identification!

PPDB data sets range from 424MB 6.8M rules to 5.7 GB, 86.4 rules. Download PPDB data sets.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress