Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 15, 2013

Indexing web sites in Solr with Python

Filed under: Indexing,Python,Solr — Patrick Durusau @ 3:44 pm

Indexing web sites in Solr with Python by Martijn Koster.

From the post:

In this post I will show a simple yet effective way of indexing web sites into a Solr index, using Scrapy and Python.

We see a lot of advanced Solr-based applications, with sophisticated custom data pipelines that combine data from multiple sources, or that have large scale requirements. Equally we often see people who want to start implementing search in a minimally-invase way, using existing websites as integration points rather than implementing a deep integration with particular CMSes or databases which may be maintained by other groups in an organisation. While crawling websites sounds fairly basic, you soon find that there are gotchas, with the mechanics of crawling, but more importantly, with the structure of websites. If you simply parse the HTML and index the text, you will index a lot of text that is not actually relevant to the page: navigation sections, headers and footers, ads, links to related pages. Trying to clean that up afterwards is often not effective; you’re much better off preventing that cruft going into the index in the first place. That involves parsing the content of the web page, and extracting information intelligently. And there’s a great tool for doing this: Scrapy. In this post I will give a simple example of its use. See Scrapy’s tutorial for an introduction and further information.

Good practice with Solr, not to mention your search activities are yours to keep private if you like. 😉

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress