Archive for the ‘Solr’ Category

10 Reasons to Choose Apache Solr Over Elasticsearch

Saturday, November 12th, 2016

10 Reasons to Choose Apache Solr Over Elasticsearch by Grant Ingersoll.

From the post:

Hey, clickbait title aside, I get it, Elasticsearch has been growing. Kudos to the project for tapping into a new set of search users and use cases like logging, where they are making inroads against the likes of Splunk in the IT logging market. However, there is another open source, Lucene-based search engine out there that is quite mature, more widely deployed and still growing, granted without a huge marketing budget behind it: Apache Solr. Despite what others would have you believe, Solr is quite alive and well, thank you very much. And I’m not just saying that because I make a living off of Solr (which I’m happy to declare up front), but because the facts support it.

For instance, in the Google Trends arena (see below or try the query yourself), Solr continues to hold a steady recurring level of interest even while Elasticsearch has grown. Dissection of these trends (which are admittedly easy to game, so I’ve tried to keep them simple), show Elasticsearch is strongest in Europe and Russia while Solr is strongest in the US, China, India, Brazil and Australia. On the DB-Engines ranking site, which factors in Google trends and other job/social metrics, you’ll see both Elasticsearch and Solr are top 15 projects, beating out a number of other databases like HBase and Hive. Solr’s mailing list is quite active (~280 msgs per week compared to ~170 per week for Elasticsearch) and it continues to show strong download numbers via Maven repository statistics. Solr as a codebase continues to innovate (which I’ll cover below) as well as provide regular, stable releases. Finally, Lucene/Solr Revolution, the conference my company puts on every year, continues to set record attendance numbers.

Not so much an “us versus them” piece as tantalizing facts about Solr 6 that will leave you wanting to know more.

Grant invites you to explore the Solr Quick Start if one or more of his ten points capture your interest.

Timely because with a new presidential administration about to take over in Washington, D.C., there will be:

  • Data leaks as agencies vie with each other
  • Data leaks due to inexperienced staffers
  • Data leaks to damage one side or in retaliation
  • Data leaks from foundations and corporations
  • others

If 2016 was the year of “false news” then 2017 is going to be the year of the “government data leak.”

Left unexplored except for headline suitable quips found with grep, leaks may not be significant.

On the other hand, using Solr 6 can enable you to weave a coherent narrative from diverse resources.

But you will have to learn Solr 6 to know for sure.

Enjoy!

Apache Lucene 6.2.1 and Apache Solr 6.2.1 Available [Presidential Data Leaks]

Thursday, September 22nd, 2016

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.2.1

Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.2.1

If you aren’t using Lucene/Solr 6.2, here’s your chance to grab the latest bug fixes as well!

Data leaks will accelerate as the US presidential election draws to a close.

What’s your favorite tool for analysis and delivery of data dumps?

Enjoy!

The Iraq Inquiry (Chilcot Report) [4.5x longer than War and Peace]

Wednesday, July 6th, 2016

The Iraq Inquiry

To give a rough sense of the depth of the Chilcot Report, the executive summary runs 150 pages. The report appears in twelve (12) volumes, not including video testimony, witness transcripts, documentary evidence, contributions and the like.

Cory Doctorow reports a Guardian project to crowd source collecting facts from the 2.6 million word report. The Guardian observes the Chilcot report is “…almost four-and-a-half times as long as War and Peace.”

Manual reading of the Chilcot report is doable, but unlikely to yield all of the connections that exist between participants, witnesses, evidence, etc.

How would you go about making the Chilcot report and its supporting evidence more amenable to navigation and analysis?

The Report

The Evidence

Other Material

Unfortunately, sections within volumes were not numbered according to their volume. In other words, volume 2 starts with section 3.3 and ends with 3.5, whereas volume 4 only contains sections beginning with “4.,” while volume 5 starts with section 5 but also contains sections 6.1 and 6.2. Nothing can be done for it but be aware that section numbers don’t correspond to volume numbers.

Lucene/Solr 6.0 Hits The Streets! (There goes the weekend!)

Friday, April 8th, 2016

From the Lucene PMC:

The Lucene PMC is pleased to announce the release of Apache Lucene 6.0.0 and Apache Solr 6.0.0

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/6.0.0
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/6.0.0

Highlights of this Lucene release include:

  • Java 8 is the minimum Java version required.
  • Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
  • Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
  • The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
  • Spatial4j has been updated to a new 0.6 version hosted by locationtech.
  • TermsQuery performance boost by a more aggressive default query caching policy.
  • IndexSearcher’s default Similarity is now changed to BM25Similarity.
  • Easier method of defining custom CharTokenizer instances.

Highlights of this Solr release include:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
  • Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
  • Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
  • The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
  • SolrClient.shutdown() has been removed, use SolrClient.close() instead.
  • The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
  • Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
  • New features and capabilities added to the streaming API.
  • Added support for SELECT DISTINCT queries to the SQL interface.
  • New GraphQuery to enable graph traversal as a query operator.
  • New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
  • Filter support added to Real-time get.
  • Column alias support added to the Parallel SQL Interface.
  • New command added to switch between non/secure mode in zookeeper.
  • Now possible to use IP fragments in replica placement rules.

For features new to Solr 6.0, be sure to consult the unreleased Solr reference manual. (unreleased as of 8 April 2016)

Happy searching!

Apache Lucene 5.3.1, Solr 5.3.1 Available

Thursday, September 24th, 2015

Apache Lucene 5.3.1, Solr 5.3.1 Available

From the post:

The Lucene PMC is pleased to announce the release of Apache Lucene 5.3.1 and Apache Solr 5.3.1

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/5.3.1
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/5.3.1

Highlights of this Lucene release include:

Bug Fixes

  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries

Highlights of this Solr release include:

Bug Fixes

  • security.json is not loaded on server start
  • RuleBasedAuthorization plugin does not work for the collection-admin-edit permission
  • VelocityResponseWriter template encoding issue. Templates must be UTF-8 encoded
  • SimplePostTool (also bin/post) -filetypes “*” now works properly in ‘web’ mode
  • example/files update-script.js to be Java 7 and 8 compatible.
  • SolrJ could not make requests to handlers with ‘/admin/’ prefix
  • Use of timeAllowed can cause incomplete filters to be cached and incorrect results to be returned on subsequent requests
  • VelocityResponseWriter’s $resource.get(key,baseName,locale) to use specified locale.
  • Resolve XSS issue in Admin UI stats page

Time to upgrade!

Enjoy!

Apache Lucene 5.2, Solr 5.2 Available

Tuesday, June 9th, 2015

From the news:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/5.2.0 and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/5.2.0

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Enjoy!

PS: Also see the Reference Guide for Solr 5.2.

Side by Side with Elasticsearch and Solr: Performance and Scalability

Tuesday, June 2nd, 2015

Side by Side with Elasticsearch and Solr: Performance and Scalability by Mick Emmett.

From the post:

Back by popular demand! Sematext engineers Radu Gheorghe and Rafal Kuc returned to Berlin Buzzwords on Tuesday, June 2, with the second installment of their “Side by Side with Elasticsearch and Solr” talk. (You can check out Part 1 here.)

Elasticsearch and Solr Performance and Scalability

This brand new talk — which included a live demo, a video demo and slides — dove deeper into into how Elasticsearch and Solr scale and perform. And, of course, they took into account all the goodies that came with these search platforms since last year. Radu and Rafal showed attendees how to tune Elasticsearch and Solr for two common use-cases: logging and product search. Then they showed what numbers they got after tuning. There was also some sharing of best practices for scaling out massive Elasticsearch and Solr clusters; for example, how to divide data into shards and indices/collections that account for growth, when to use routing, and how to make sure that coordinated nodes don’t become unresponsive.

Video is coming soon, and in the meantime please enjoy the slides:

After you see the presentation and slides (parts 1 and 2), you will understand the “popular demand” for these authors.

The best comparison of Elasticsearch and Solr that you will see this year. (Unless the presenters update their presentation before the end of the year.)

Relevant Search

Tuesday, June 2nd, 2015

Relevant Search – With examples using Elasticsearch and Solr by Doug Turnbull and John Berryman.

From the webpage:

Users expect search to be simple: They enter a few terms and expect perfectly-organized, relevant results instantly. But behind this simple user experience, complex machinery is at work. Whether using Solr, Elasticsearch, or another search technology, the solution is never one size fits all. Returning the right search results requires conveying domain knowledge and business rules in the search engine’s data structures, text analytics, and results ranking capabilities.

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. Relevant Search walks through several real-world problems using a cohesive philosophy that combines text analysis, query building, and score shaping to express business ranking rules to the search engine. It outlines how to guide the engineering process by monitoring search user behavior and shifting the enterprise to a search-first culture focused on humans, not computers. You’ll see how the search engine provides a deeply pluggable platform for integrating search ranking with machine learning, ontologies, personalization, domain-specific expertise, and other enriching sources.

  • Creating a foundation for Lucene-based search (Solr, Elasticsearch) relevance internals
  • Bridging the field of Information Retrieval and real-world search problems
  • Building your toolbelt for relevance work
  • Solving search ranking problems by combining text analysis, query building, and score shaping
  • Providing users relevance feedback so that they can better interact with search
  • Integrating test-driven relevance techniques based on A/B testing and content expertise
  • Exploring advanced relevance solutions through custom plug-ins and machine learning

Now imagine relevancy searching where a topic map contains multiple subject identifications for a single subject, from different perspectives.

Relevant Search is in early release but the sooner you participate, the fewer errata there will be in the final version.

Apache Lucene 5.1.0, Solr 5.1.0 Available

Thursday, May 7th, 2015

From the news:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/5.1.0 and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/5.1.0

Both releases contain a number of new features, bug fixes, and optimizations.

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

See also Solr 5.1 Features by Yonik Seeley.

Of particular interest, Streaming Aggregation For SolrCloud (new in Solr 5.1) by Joel Bernstein.

Enjoy!

Apache Solr 5.0 Highlights

Monday, February 23rd, 2015

Apache Solr 5.0 Highlights by Anshum Gupta.

From the post:

Usability improvements

A lot of effort has gone into making Solr more usable, mostly along the lines of introducing APIs and hiding implementation details for users who don’t need to know. Solr 4.10 was released with scripts to start, stop and restart Solr instance, 5.0 takes it further in terms of what can be done with those. The scripts now for instance, copy a configset on collection creation so that the original isn’t changed. There’s also a script to index documents as well as the ability to delete collections in Solr. As an example, this is all you need to do to start SolrCloud, index lucidworks.com, browse through what’s been indexed, and clean up the collection.

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted http://lucidworks.com
open http://localhost:8983/solr/gettingstarted/browse
bin/solr delete -c gettingstarted

Another important thing to note for new users is that Solr no longer has the default collection1 and instead comes with multiple example config-sets and data.

That is just a tiny part of Gupta’s post on just highlights of Apache Solr 5.0.

Download Apache Solr 5.0 to experience the improvements for yourself!

Download the improved Solr 5.0 Reference Manual as well!

Apache Solr 5.0.0 and Reference Guide for 5.0 available

Sunday, February 22nd, 2015

Apache Solr 5.0.0 and Reference Guide for 5.0 available

For the impatient:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Changes.txt

https://s.apache.org/Solr-Ref-Guide-PDF

From the post::

Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world’s largest internet sites.

Solr 5.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Solr 5.0 Release Highlights:

  • Usability improvements that include improved bin scripts and new and restructured examples.
  • Scripts to support installing and running Solr as a service on Linux.
  • Distributed IDF is now supported and can be enabled via the config. Currently, there are four supported implementations for the same:

    • LocalStatsCache: Local document stats.
    • ExactStatsCache: One time use aggregation
    • ExactSharedStatsCache: Stats shared across requests
    • LRUStatsCache: Stats shared in an LRU cache across requests
  • Solr will no longer ship a war file and instead be a downloadable application.
  • SolrJ now has first class support for Collections API.
  • Implicit registration of replication,get and admin handlers.
  • Config API that supports paramsets for easily configuring solr parameters and configuring fields. This API also supports managing of pre-existing request handlers and editing common solrconfig.xml via overlay.
  • API for managing blobs allows uploading request handler jars and registering them via config API.
  • BALANCESHARDUNIQUE Collection API that allows for even distribution of custom replica properties.
  • There’s now an option to not shuffle the nodeSet provided during collection creation.
  • Option to configure bandwidth usage by Replication handler to prevent it from using up all the bandwidth.
  • Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud. This is also the default format for new Collections that would be created going forward.
  • timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient request retry.
  • pivot.facet results can now include nested stats.field results constrained by those pivots.
  • stats.field can be used to generate stats over the results of arbitrary numeric functions.
    It also allows for requesting for statistics for pivot facets using tags.
  • A new DateRangeField has been added for indexing date ranges, especially multi-valued ones.
  • Spatial fields that used to require units=degrees now take distanceUnits=degrees/kilometers miles instead.
  • MoreLikeThis query parser allows requesting for documents similar to an existing document and also works in SolrCloud mode.
  • Logging improvements:

    • Transaction log replay status is now logged
    • Optional logging of slow requests.

Solr 5.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Also available is the Solr Reference Guide for Solr 5.0. This 535 page PDF serves as the definitive user’s manual for Solr 5.0. It can be downloaded from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF

This is the beginning of a great week!

Enjoy!

Solr 5.0 RC3!

Thursday, February 19th, 2015

Yes, Solr 5.0 RC3 has dropped!

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html which will toss you out at the 4.10.3 release.

Let it take you to the suggested site and then move up and then down into the 5.0 directory.

Download.

Enjoy!

PS: For Lucene, follow the same directions but going to the Lucene download page.

Solr 5.0 Will See Another RC – But Docs Are Available

Friday, February 13th, 2015

I saw a tweet from Anshum Gupta today saying:

Though the vote passed, seems like there’s need for another RC for #Apache #Lucene / #Solr 5.0. Hopefully we’d be third time lucky.

To brighten your weekend prospects, the Apache Solr Reference Guide for Solr 5.0 is available.

With an other Solr RC on the horizon, now would be a good time to spend some time with the reference guide. Both in terms of new features and to smooth out any infelicities in the documentation.

Draft Lucene 5.0 Release Highlights

Friday, January 23rd, 2015

Draft Lucene 5.0 Release Highlights

Just a draft of Lucene 5.0 release notes but it is a signal that the release is getting closer!

Or as the guy said in Star Wars, “…almost there!” Hopefully with happier results.

Update: My bad, I forgot to include the Solr 5.0 draft release notes as well!

http://wiki.apache.org/solr/ReleaseNote50

Solr 2014: A Year in Review

Friday, January 9th, 2015

Solr 2014: A Year in Review by Anshum Gupta.

If you aren’t already excited about Solr 5, targeted for alter this month, perhaps these section headings from Anshum’s post will capture your interest:

Usability – Ease of use and management

SolrCloud and Collection APIs

Scalability and optimizations

CursorMark: Distributed deep paging

TTL: Auto-expiration for documents

Distributed Pivot Faceting

Query Parsers

Distributed IDF

Solr Scale Toolkit

Testing

No more war

Solr 5

Community

That is a lot of improvement for a single year! See Anshum’s post and you will be excited about Solr 5 too!

Solr 5 Preview (Podcast) [Update on Solr 5 Release Target Date]

Tuesday, January 6th, 2015

Solr 5 Preview with Anshum Gupta and Tim Potter

Description:

Solr committers Anshum Gupta and Tim Potter tell us about the upcoming Solr 5 release. We discuss making Solr “easy to start, easy to finish” while continuing to add improvements and stability for experienced users. Hear more about SolrCloud hardening, clusterstate improvements, the schema and solrconfig APIs, easier ZooKeeper management, improved flexible and schemaless indexing, and overall ease-of-use improvements.

Some notes:

Focus in Solr 5 development has been on ease of use. Directory layout of Solr install has been changed. 5.0 gets rid of the war file. Stand alone application. Don’t have to add parts to it. Don’t need Tomcat. Distributed IDF management. (Documents used to score differently based on shard where they reside. Not so in 5.0 (SOLR-1632)) API access to config files. Not schema-less so much but smarter about doing reasonable things by default.

The one missing question?

What is the anticipated release date for Solr 5?

I did look at the roadmap for 5.0, “No release date.” As of today, 228 of 313 issues have been resolved.

Here’s an open issue that may interest some of you: Create a shippable tutorial integrated with running Solr instance. That’s SOLR-6808 for those following in your hymn books.

Enjoy!


Update: Solr 5 is targeted for late January 2015! Hot damn!

The Big Book of PostgreSQL

Sunday, November 30th, 2014

The Big Book of PostgreSQL by Thom Brown.

From the post:

Documentation is crucial to the success of any software program, particularly open source software (OSS), where new features and functionality are added by many contributors. Like any OSS, PostgreSQL needs to produce accurate, consistent and reliable documentation to guide contributors’ work and reflect the functionality of every new contribution. Documentation also an important source of information for developers, administrators and other end users as they will take actions or base their work on the functionality described in the documentation. Typically, the author of a new feature provides the relevant documentation changes to the project, and that person can be anyone in any role in IT. So it can really come from anywhere.

Postgres documentation is extensive (you can check out the latest 9.4 documentation here). In fact, the U.S. community PDF document is 2,700 pages long. It would be a mighty volume and pretty unwieldy if published as a physical book. The Postgres community is keenly aware that the quality of documentation can make or break an open source project, and thus regularly updates and improves our documentation, a process I’ve appreciated being able to take part in.

A recent podcast, Solr Usability with Steve Rowe & Tim Potter goes to some lengths to describe efforts to improve Solr documentation.

If you know anyone in the Solr community, consider this a shout out that PostgreSQL documentation isn’t a bad example to follow.

Solr vs. Elasticsearch – Case by Case

Saturday, November 22nd, 2014

Solr vs. Elasticsearch – Case by Case by Alexandre Rafalovitch.

From the description:

A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is coming later.

Just the highlights and those from an admitted ElasticSearch user.

One very telling piece of advice for Solr:

Solr – needs to buckle down and focus on the onboarding experience

Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Just in case you don’t know the term: onboarding.

And SolrCluster podcast of October 24, 2014: Solr Usability with Steve Rowe & Tim Potter

From the description:

In this episode, Lucene/Solr Committers Steve Rowe and Tim Potter join the SolrCluster team to discuss how Lucidworks and the community are making changes and improvements to Solr to increase usability and add ease to the getting started experience. Steve and Tim discuss new features such as data-driven schema, start-up scripts, launching SolrCloud, and more. (length 33:29)

Paraphrasing:

…focusing on the first five minutes of the Solr experience…hard to explore if you can’t get it started…can be a little bit scary at first…has lacked a focus on accessibility by ordinary users…need usability addressed throughout the lifecycle of the product…want to improve kicking the tires on Solr…lowering mental barriers for new users…do now have start scripts…bakes in a lot of best practices…scripts for SolrCloud…hide all the weird stuff…data driven schemas…throw data at Solr and it creates an index without creating a schema…working on improving tutorials and documentation…moving towards consolidating information…will include use cases…walk throughs…will point to different data sets…making it easier to query Solr and understand the query URLs…bringing full collections API support to the admin UI…Rest interface…components report possible configuration…plus a form to interact with it directly…forms that render in the browser…will have a continued focus on usability…not a one time push…new users need to submit any problems they encounter….

Great podcast!

Very encouraging on issues of documentation and accessibility in Solr.

Solr/Lucene 5.0 (December, 2014)

Wednesday, November 12th, 2014

Just so you know, email traffic suggests a release candidate for Solr/Lucene 5.0 will appear in December, 2014.

If you are curious, see the unreleased Solr Reference Guide (for Solr 5.0).

If you are even more curious, see the issues targeted for Solr 5.0.

OK, I have to admit that not everyone uses Solr so see also the issues targeted for Lucene 5.0.

Nothing like a pre-holiday software drop to provide leisure activities for the holidays!

Solr’s New Website [with comments]

Wednesday, November 12th, 2014

Solr

Solr has a snazzy new website!

A couple of comments though:

Features

The Features page starts with impressive svg icons but aren’t hyperlinks to more information on the page or elsewhere. Seems like a wasted opportunity to navigate to deeper information about that particular feature.

Further down on Features there are large bands that headline “detailed features,” which don’t correspond to the features named in the SVG icons, although in addition to brief text, they offer hyperlinks to the Solr Ref Guide.

Inserting Solr Ref Guide links for the more detailed SVG icons would accord with my expectations for such a page. You?

Would you still need the no particular order “detailed features?”

Resources

The Resources cites very high quality materials but it seems a bit sparse considering the wide usage of Solr.

Moreover, I’m not sure the search links to Slideshare, Lucene/Solr Revolution, YouTube and Vimeo are as useful as possible.

The Search *** for Solr links with comments:

Search Slideshare for Solr:

Varying results. The URL http://www.slideshare.net/search/slideshow?&q=solr returns 5769 “hits” consistently. However, if you substitute the entity reference for &, that is & in the string between the “?” and “q”, the results are 4243 “hits” consistently.

I discovered the difference because I used the resolved entity reference in the URL for this post and checking the link gave a different answer than the URL at the Solr page.

The general search results are in no particular date order. Add a date to your “Solr” search string to narrow the results down. Adding 2012 to the search string gives one thousand one hundred and seventy-five (1,175) “hits.” Not that I would want to search that many presentations for one relevant to a particular issue. Curated indexing would make a vast difference in the usefulness of Slideshare.

Lucene/Solr Revolution Videos from Past Events

Prime content for Lucene/Solr and yearly organization helps you guess at which Lucene/Solr version is likely to be covered. Eight conferences and there is no index by across the years by concept, issue, etc. Happy hunting!

Search YouTube for Solr

Ten thousand and six hundred (10,600) “hits” where the first “hit” is four years old. Yeah. Searching YouTube is like flushing a toilet and hoping something interesting comes into view.

At a minimum, use Vimeo Solr search sorted by date link which gives you the videos sorted by upload date. YouTube does have within N time but it only goes to one year and no ranges.

Search Vimeo for Solr

Two hundred and seventy-two (272) “hits” with the top ones being two (2) and five (5) years ago. Certainly not in date order.

At a minimum, use: Vimeo Solr search sorted by date link instead.

Be aware that slides and videos resources tend to overlap so you are likely to have to dedupe your results with every use.

A deduped and curated index of Lucene/Solr resources would be a real boon to developers/users.


Update: November 16, 2014. Apparently other people shared my concerns over the homepage and it is now substantially better than I reported above. Alas, the search links I mention remain as reported.

Apache Lucene and Solr 4.10

Sunday, September 21st, 2014

Apache Lucene and Solr 4.10

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.10. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.

Here are some of the changes that were made comparing to the 4.9:

Lucene

  • Simplified Version handling for analyzers
  • TermAutomatonQuery was added
  • Optimizations and bug fixes

Solr

  • Ability to automatically add replicas in SolrCloud mode in HDFS
  • Ability to export full results set
  • Distributed support for facet.pivot
  • Optimizations and bugfixes from Lucene 4.9

Full changes list for Lucene can be found at http://wiki.apache.org/lucene-java/ReleaseNote410. Full list of changes in Solr 4.10 can be found at: http://wiki.apache.org/solr/ReleaseNote410.

Apache Lucene 4.10 library can be downloaded from the following address: http://www.apache.org/dyn/closer.cgi/lucene/java/. Apache Solr 4.10 can be downloaded at the following URL address: http://www.apache.org/dyn/closer.cgi/lucene/solr/. Please remember that the mirrors are just starting to update so not all of them will contain the 4.10 version of Lucene and Solr.

A belated note about Apache Lucene and Solr 4.10.

I must have been distracted by the continued fumbling with the Ebola crisis. I no longer wonder how the international community would respond to an actual world wide threat. In a word, ineffectively.

Introducing Splainer…

Monday, August 25th, 2014

Introducing Splainer — The Open Source Search Sandbox That Tells You Why by Doug Turnbull.

Splainer is a step towards addressing two problems:

From the post:

  • Collaboration: At OpenSource Connections, we believe that collaboration with non-techies is the secret ingredient of search relevancy. We need to arm business analysts and content experts with a human readable version of the explain information so they can inform the search tuning process.
  • Usability: I want to paste a Solr URL, full of query paramaters and all, and go! Then, once I see more helpful explain information, I want to tweak (and tweak and tweak) until I get the search results I want. Much like some of my favorite regex tools. Get out of the way and let me tune!
  • ….

    We hope you’ll give it a spin and let us know how it can be improved. We welcome your bugs, feedback, and pull requests. And if you want to try the Splainer experience over multiple queries, with diffing, results grading, a develoment history, and more — give Quepid a spin for free!

Improving the information content of the tokens you are searching is another way to improve search results.

Solr-Wikipedia

Tuesday, August 19th, 2014

Solr-Wikipedia

From the webpage:

A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.

I haven’t tried this, yet, but utilities for major data sources are always welcome!

Side by side with Elasticsearch and Solr

Sunday, August 3rd, 2014

Side by side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe.

Abstract:

We all know that Solr and Elasticsearch are different, but what those differences are and which solution is the best fit for a particular use case is a frequent question. We will try to make those differences clear, not by showing slides and compare them, but by showing online demo of both Elasticsearch and Solr:

  • Set up and start both search servers. See what you need to prepare and launch Solr and Elasticsearch
  • Index data right after the server was started using the “schemaless” mode
  • Create index structure and modify it using the provided API
  • Explore different query use cases
  • Scale by adding and removing nodes from the cluster, creating indices and managing shards. See how that affects data indexing and querying.
  • Monitor and administer clusters. See what metrics can be seen out of the box, how to get them and what tools can provide you with the graphical view of all the goodies that each search server can provide.

Slides

Very impressive split-screen comparison of Elasticsearch and Solr by two presenters on the same data set.

I first saw this at: Side-By-Side with Solr and Elasticsearch : A Comparison by Charles Ditzel.

Solr’s New AnalyticsQuery API

Tuesday, July 29th, 2014

Solr’s New AnalyticsQuery API by Joel Bernstein.

From the post:

In Solr 4.9 there is a new AnalyticsQuery API that allows developers to plug custom analytic logic into Solr. The AnalyticsQuery class provides a clean and simple API that gives developers access to all the rich functionality in Lucene and is strategically placed within Solr’s distributed search architecture. Using this API you can harness all the power of Lucene and Solr to write custom analytic logic.

Not all the detail you are going to want but a good start towards using the new AnalyticsQuery API in Solr 4.9.

The AnalyticsQuery API is an example of why I wonder about projects with custom search solutions (read not Lucene-based).

If you have any doubts, default to a Lucene-based search solution.

Flax Clade PoC

Monday, July 14th, 2014

Flax Clade PoC by Tom Mortimer.

From the webpage:

Flax Clade PoC is a proof-of-concept open source taxonomy management and document classification system, based on Apache Solr. In its current state it should be considered pre-alpha. As open-source software you are welcome to try, use, copy and modify Clade as you like. We would love to hear any constructive suggestions you might have.

Tom Mortimer tom@flax.co.uk


Taxonomies and document classification

Clade taxonomies have a tree structure, with a single top-level category (e.g. in the example data, “Social Psychology”). There is no distinction between parent and child nodes (except that the former has children) and the hierachical structure of the taxonomy is completely orthogonal from the node data. The structure may be freely edited.

Each node represents a category, which is represented by a set of “keywords” (words or phrases) which should be present in a document belonging to that category. Not all the keywords have to be present – they are joined with Boolean OR rather than AND. A document may belong to multiple categories, which are ranked according to standard Solr (TF-IDF) scoring. It is also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of the documents already in the category. This feature is currently slow as it uses the standard Solr MoreLikeThis component to analyse a large number of documents. We plan to improve this for a future release by writing a custom Solr plugin.

Documents are stored in a standard Solr index and are categorised dynamically as taxonomy nodes are selected. There is currently no way of writing the categorisation results to the documents in SOLR, but see below for how to export the document categorisation to an XML or CSV file.

A very interesting project!

I am particularly interested in the dynamic categorisation when nodes are selected.

Fess

Monday, June 30th, 2014

Fess: Open Source Enterprise Search Server

From the homepage:

Fess is very powerful and easily deployable Enterprise Search Server. You can install and run Fess quickly on any platforms, which have Java runtime environment. Fess is provided under Apache license.

[image omitted]

Fess is Solr based search server, but knowledge/experience about Solr is NOT needed because of All-in-One Enterprise Search Server. Fess provides Administration GUI to configure the system on your browser. Fess also contains a crawler, which can crawl documents on Web/File System/DB and support many file formats, such as MS Office, pdf and zip.

Features

  • Very Easy Installation/Configuration
  • Apache License (OSS)
  • OS-independent (Runs on Java)
  • Crawl documents on Web/File System/DB/Windows Shared Folder
  • Support many document types, such as MS Office, PDF, Zip archive,…
  • Support a web page for BASIC/DIGEST/NTLM authentication
  • Contain Apache Solr as a search engine
  • Provide UI as a responsive web design
  • Provide a browser based administative page
  • Support a role authentication
  • Support XML/JSON/JSONP format
  • Provide a search/click log and statistics
  • Provide auto-complete(suggest)

Sounds interesting enough.

I don’t have a feel for the trade-offs between a traditional Solr/Tomcat install and what appears to be a Solr-out-of-the-box solution. At least not today.

I recently built a Solr/Tomcat install on a VM so this could be a good comparison to that process.

Any experience with Fess?

Apache Lucene/Solr 4.9.0 Released!

Saturday, June 28th, 2014

From the announcement:

The Lucene PMC is pleased to announce the availability of Apache Lucene 4.9.0 and Apache Solr 4.9.0.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the Lucene CHANGES.txt and Solr CHANGES.txt files included with the release for a full list of details.

Combined Solr and Lucene Javadoc 4.8.0

Sunday, June 8th, 2014

Combined Solr and Lucene Javadoc 4.8.0

A resource built by Solr Start using …., you guessed, Solr. 😉

From the Solr Start homepage:

Welcome to the collection of resources to make Apache Solr more comprehensible to beginner and intermediate users. While Solr is very easy to start with, tuning it is – like for any search engine – fairly complex. This website will try to make this simpler by compiling information and creating tools to accelerate learning Solr. The currently available resources are linked in the menubar above. More resources will be coming shortly.

If you would like to be notified of such new resources, get early access and receive exclusive discounts on commercial tools, join the mailing list below:

I’m curious.

This Javadocs resource will be very useful but obviously Javadocs are missing something or else there would be fewer presentations, papers, blogs, etc., on issues covered by the Javadocs.

Yes?

While I applaud the combined index of Lucene and Solr Javadocs, what would an index have to cover beyond the Javadocs to be really useful to you?

I first saw this in a tweet by SolrStart.

Crawling With Nutch

Tuesday, May 27th, 2014

Crawling With Nutch by Elizabeth Haubert.

From the post:

Recently, I had a client using LucidWorks search engine who needed to integrate with the Nutch crawler. This sounds simple as both products have been around for a while and are officially integrated. Even better, there are some great “getting started in x minutes” tutorials already out there for both Nutch, Solr and LucidWorks. But there were a few gotchas that kept those tutorials from working for me out of the box. This blog post documents my process of getting Nutch up and running on a Ubuntu server.
….

I know exactly what Elizabeth means, I have yet to find a Nutch/Solr tutorial that isn’t incomplete in some way.

What is really amusing is to try to setup Tomcat 7, Solr and Nutch.

I need to write up that experience sometime fairly soon. But no promises if you vary from the releases I document.