Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 7, 2011

Boilerpipe

Filed under: Data Mining,Java — Patrick Durusau @ 4:15 pm

Boilerpipe

From the webpage:

The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

Should save you some time when harvesting data from webpages.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress