Understanding Information Retrieval by Using Apache Lucene and Tika, Part 1
Understanding Information Retrieval by Using Apache Lucene and Tika, Part 2
Understanding Information Retrieval by Using Apache Lucene and Tika, Part 3
by Ana-maria Mihalceanu.
From part 1:
In this tutorial, the Apache Lucene and Apache Tika frameworks will be explained through their core concepts (e.g. parsing, mime detection, content analysis, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. We assume you have a working knowledge of the Java™ programming language and plenty of content to analyze.
Throughout this tutorial, you will learn:
- how to use Apache Tika’s API and its most relevant functions
- how to develop code with Apache Lucene API and its most important modules
- how to integrate Apache Lucene and Apache Tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download)
Part 1 introduces you to Apache Lucene and Apache Tika and concludes by covering automatic extraction of metadata from files with Apache Tika.
Part 2 covers extracting/indexing of content, along with stemming, boosting and scoring. (If any of that sounds unfamiliar, this isn’t the best tutorial for you.)
Part 3 details the highlighting of fragments when they match a search query.
A good tutorial on Apache Lucene and Apache Tika, what parts of them are covered, but there was no coverage of information retrieval. For example, part 3 talks about increasing search “efficiency” without any consideration of what “efficiency” might mean in a particular search context.
Illuminating issues in information retrieval using Apache Lucene and Tika as opposed to coding up an indexing/searching application with no discussion of the potential choices and tradeoffs would make a much better tutorial.