Indexing data in Solr from disparate sources using Camel by Bilgin Ibryam.
From the post:
Apache Solr is ‘the popular, blazing fast open source enterprise search platform’ built on top of Lucene. In order to do a search (and find results) there is the initial requirement of data ingestion usually from disparate sources like content management systems, relational databases, legacy systems, you name it… Then there is also the challenge of keeping the index up to date by adding new data, updating existing records, removing obsolete data. The new sources of data could be the same as the initial ones, but could also be sources like twitter, AWS or rest endpoints.
Solr can understand different file formats and provides fair amount of options for data indexing:
- Direct HTTP and remote streaming – allows you to interact with Solr over HTTP by posting a file for direct indexing or the path to the file for remote streaming.
- DataImportHandler – is a module that enables both full and incremental delta imports from relational databases or file system.
- SolrJ – a java client to access Solr using Apache Commons HTTP Client.
But in real life, indexing data from different sources with millions of documents, dozens of transformations, filtering, content enriching, replication, parallel processing requires much more than that. One way to cope with such a challenge is by reinventing the wheel: write few custom applications, combine them with some scripts or run cronjobs. Another approach would be to use a tool that is flexible and designed to be configurable and plugable, that can help you to scale and distribute the load with ease. Such a tool is Apache Camel which has also a Solr connector now.
(…)
Avoid reinventing the wheel:
Robust software:
Name recognition of Lucene/Solr:
Name recognition of Camel:
Do you see any negatives?
BTW, the examples that round out Bilgin’s post are quite useful!