Archive for the ‘European Parliment Proceedings Corpus’ Category

European Parliament Proceedings Parallel Corpus 1996-2011

Sunday, November 18th, 2012

European Parliament Proceedings Parallel Corpus 1996-2011

From the webpage:

For a detailed description of this corpus, please read:

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

Version 7, released in May of 2012, has around 60 million words per language.

Just in case you need a corpus for the EU.

I would be mindful of its parlimentary context. Semantic equivalent or similarity there may not hold true for other contexts.