Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 24, 2013

Name Search in Solr

Filed under: Searching,Solr — Patrick Durusau @ 6:46 pm

Name Search in Solr by Doug Turnbull.

From the post:

Searching names is a pretty common requirement for many applications. Searching by book authors, for example, is a pretty crucial component to a book store. And as it turns out names are actually a surprisingly hard thing to get perfect. Regardless, we can get something pretty good working in Solr, at least for the vast-majority of Anglicized representations.

We can start with the assumption that aside from all the diversity in human names, that a name in our Authors field is likely going to be a small handful of tokens in a single field. We’ll avoid breaking these names up by first, last, and middle names (if these are even appropriate in all cultural contexts). Let’s start by looking at some sample names in our “Authors” field:

Doug has a photo of library shelves in his post with the caption:

Remember the good ole days of “Alpha by Author”?

True but books listed their authors in various forms. Librarians were the ones who imposed a canonical representation on author names.

Doug goes through basic Solr techniques for matching author names when you don’t have the benefit of librarians.

Agenda for Lucene/Solr Revolution EU! [Closes September 9, 2013]

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 6:34 pm

Help Us Set the Agenda for Lucene/Solr Revolution EU! by Laura Whalen.

From the post:

Thanks to all of you who submitted an abstract for the Lucene/Solr Revolution EU 2013 conference in Dublin. We had an overwhelming response to the Call for Papers, and narrowing the topics from the many great submissions was a difficult task for the Conference Committee. Now we need your help in making the final selections!

Vote now! Community voting will close September 9, 2013.

The Lucene/Solr Revolution free voting system allows you to vote on your favorite topics. The sessions that receive the highest number of votes will be automatically added to the Lucene/Solr Revolution EU 2013 agenda. The remaining sessions will be selected by a committee of industry experts who will take into account the community’s votes as well as their own expertise in the area. Click here to start voting for your favorites.

Your chance to influence the Lucene/Solr Revolution agenda for Dublin! (November 4-7)

PS: As of August 24, 2013, about 11:33 UTC, I was getting a server error from the voting link. Maybe overload of voters?

August 21, 2013

Techniques To Improve Your Solr Search Results

Filed under: Drupal,Solr — Patrick Durusau @ 4:28 pm

Techniques To Improve Your Solr Search Results by Chris Johnson.

From the post:

Solr is a tremendously popular option for providing search functionality in Drupal. While Solr provides pretty good default results, making search results great requires analysis of what your users search for, consideration of which data is sent to Solr, and tuning of Solr’s ‘boosting’. In this post, I will show you a few techniques that can help you leverage Solr to produce great results. I will specifically be covering the Apache Solr Search module. Similar concepts exist in the Search API Solr Search module, but with different methods of configuring boosting and altering data sent to Solr.

If you are interested in using Solr in a Drupal environment, here is a starting place for you.

Wrap-up of the Solr Usability Contest

Filed under: Solr,Usability — Patrick Durusau @ 4:20 pm

Wrap-up of the Solr Usability Contest by Alexandre Rafalovitch.

From the post:

The Solr Usability Contest has finished. It run for four weeks, has received 29 suggestions, 113 votes and more than 300 visits. People from several different Solr communities participated.

See Alexandre’s post for the 29 suggestions.

Six (6) of them, including the #1 suggestion, concern documentation.

August 20, 2013

Solr and available query parsers

Filed under: Parsers,Solr — Patrick Durusau @ 4:20 pm

Solr and available query parsers

From the post:

Every now and than there is a question appearing on the mailing list – what type of query parsers are available in Solr. So we decided to make such a list with a short description about each of the query parsers available. If you are interested to see what Solr has to offer, please read the rest of this post.

I count eighteen (18) query parsers available for Solr.

If you can’t name each one and give a brief description of its use, you need to read this post.

Solr Tutorial [No Ads]

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 2:38 pm

Solr Tutorial from the Apache Software Foundation.

A great tutorial on Solr that is different from most of the Solr tutorials you will ever see.

There are no ads, popup or otherwise. 😉

Should be the first tutorial that you recommend for anyone new to Solr!

PS: You do not have to give your email address, phone number, etc. to view the tutorial.

August 17, 2013

Creating a Solr <search> HTML element…

Filed under: Interface Research/Design,Javascript,Solr — Patrick Durusau @ 4:07 pm

Creating a Solr <search> HTML element with AngularJS! by John Berryman.

From the post:

Of late we’ve been playing around with EmberJS for putting together slick client-side apps. But one thing that bothers me is how heavy-weight it feels. Another thing that concerns me is that AngularJS is really getting a lot of good attention and I want to make sure I’m not missing the boat! Here, look, just check out the emberjs/angularjs Google Trends plot: – See more at: http://www.opensourceconnections.com/2013/08/11/creating-a-search-html-element-with-angularjs/#sthash.ZH22mU0h.dpuf

It’s great to have a rocking search, topic map, or other retrieval application.

However, to make any sales, it needs to also deliver content to users.

I know, pain in the ass but people who pay for things want a result on the screen, intangible though it may be. 😉

August 13, 2013

Of collapsing in Solr

Filed under: Search Engines,Searching,Solr,Topic Maps — Patrick Durusau @ 4:35 pm

Of collapsing in Solr by Paul Masurel.

From the post:

This post is about the innerworkings of one of the two most popular open source search engines : Solr. I noticed that many questions (one or two everyday) on solr-user’s mailing list were about Solr’s collapsing functionality.

I thought it would be a good idea to explain how Solr’s collapsing is working. Because its documentation is very sparse, and because a search engine is the kind of car you to take a peek under the hood to make sure you’ll drive it right.

The Solr documentation at Apache refers to field collapsing and result grouping being “different ways to think about the same Solr feature.”

I read the post along with the Solr documentation.

BTW, note from “Known Limitations” in the Solr documentation:

Support for grouping on a multi-valued field has not yet been implemented.

That would be really nice with subjectIdentifier and subjectLocator having the potential to be sets of values.

Solr as an Analytics Platform

Filed under: Analytics,Search Engines,Solr — Patrick Durusau @ 4:19 pm

Solr as an Analytics Platform by Chris Becker.

From the post:

Here at Shutterstock we love digging into data. We collect large amounts of it, and want a simple, fast way to access it. One of the tools we use to do this is Apache Solr.

Most users of Solr will know it for its power as a full-text search engine. Its text analyzers, sorting, filtering, and faceting components provide an ample toolset for many search applications. A single instance can scale to hundreds of millions of documents (depending on your hardware), and it can scale even further through sharding. Modern web search applications also need to be fast, and Solr can deliver in this area as well.

The needs of a data analytics platform aren’t much different. It too requires a platform that can scale to support large volumes of data. It requires speed, and depends heavily on a system that can scale horizontally through sharding as well. And some of the main operations of data analytics – counting, slicing, and grouping — can be implemented using Solr’s filtering and faceting options.
(…)

A good introduction to obtaining useful results with Solr with a minimum of effort.

Certainly a good way to show ROI when you are convincing your manager to sponsor you for a Solr conference and/or training.

August 11, 2013

Embedding Concepts in text for smarter searching with Solr4

Filed under: Concept Detection,Indexing,Searching,Solr — Patrick Durusau @ 7:08 pm

Embedding Concepts in text for smarter searching with Solr4 by Sujit Pal.

From the post:

Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept’s position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens – for example, the phrase “breast cancer” becomes “breast0cancer”. This allows us to rewrite queries such as “breast cancer radiotherapy”~5 as “2790981 2791965″~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms – for example, “breast cancer” can be variously called “breast neoplasm”, “breast CA”, etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, “radiotherapy for breast cancer patients” would not match our query since “breast cancer patient” is a different concept than “breast cancer” and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

So if Solr4 can make documents smarter, can the same be said about topics?

Recalling that “document” for Solr is defined by your indexing, not some arbitrary byte count.

As we are indexing topics we could add information to topics to make merging more robust.

One possible topic map flow being:

Index -> addToTopics -> Query -> Results -> Merge for Display.

Yes?

August 2, 2013

Named Entity Recognition (NER) in Solr

Filed under: Entity Extraction,Entity Resolution,Named Entity Mining,Solr — Patrick Durusau @ 2:43 pm

Named Entity Recognition (NER) in Solr

From the post:

Named Entity Recognition, or NER for short, is a powerful paradigm which causes entities to be recognized within text. Typically these objects can be places, organizations or people. For example, given the phrase “Jon works at Searchbox”, a good NER would return that Jon is a person and Searchbox is an organization. Why is this powerful, especially in Solr? Using this information we can not only propose better suggestions for users searching for things, but using Solr faceting capability we’ll have the ability to facet directly on organizations (or people) without having to manually identify them in all of the documents.

In this blog post, extending from our two previous slideshares on how to develop search components and request handlers, we’ll teach you how to directly embed Stanford’s NER library into a production ready plugin which provides all of the mentioned benefits. We of course provide the full source code packaging here.

Very nice walk through on entity recognition with Solr.

Thought occurs to me that every instance of an entity that is recognized, could be presented to a user as occurrences of that entity. Plugging that search result into a topic that represents the subject.

So there is some static aspect to the topic map, the topic for that subject and a dynamic aspect, being the search results presented as occurrences.

You could enter information or relationships you discover in the occurrences on the static side of the map. Let software manage metadata from the document containing the occurrence.

August 1, 2013

Open Source Search FTW!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:40 pm

Open Source Search FTW! by Grant Ingersoll.

Abstract:

Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we’ll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.

If you aren’t already studying search engines, perhaps these slides will convince you to do so.

When you think about it, search precedes all other computer processing.

July 26, 2013

Announcing Solr Usability contest

Filed under: Solr,Usability — Patrick Durusau @ 1:02 pm

Announcing Solr Usability contest by Alexandre Rafalovitch.

From the post:

In collaboration with Packt Publishing and to celebrate the release of my new book Instant Apache Solr for Indexing Data How-to, we are organizing a contest to collect Solr Usability ideas.

I have written about the reasons behind the book before and the contest builds on that idea. Basically, I feel that a lot of people are able to start with Solr and get basic setup running, either directly or as part of other projects Solr is in. But then, they get stuck at a local-maximum of their understanding and have difficulty moving forward because they don’t fully comprehend how their configuration actually works or which of the parameters can be tuned to get results. And the difficulty is even greater when the initial Solr configuration is generated by an external system, such as Nutch, Drupal or SiteCore automatically behind the scenes.

The contest will run for 4 weeks (until mid-August 2013) and people suggesting the five ideas with most votes will get free electronic copies of my book. Of course, if you want to get the book now, feel free. I’ll make sure you will get rewarded in some other way, such as through advanced access to the upcoming Solr tools like SolrLint.

The results of the contest will be analyzed and fed into Solr improvement by better documentation, focused articles or feature requests on issue trackers. The end goal is not to give away a couple of books. There are much easier ways to do that. The goal is to improve Solr with specific focus on learning curve and easy adoption and integration.

Only five (5) suggestions so far?

Solr must have better tuning documentation than I have found. 😉

Do you have a suggestion?

Lucene/Solr Revolution EU 2013 – Reminder

Filed under: Conferences,Lucene,Solr — Patrick Durusau @ 11:34 am

Lucene/Solr Revolution EU 2013 – Reminder

The deadline for submitting an abstract is August 2, 2013.

Key Dates:

June 3, 2013: CFP opens
August 2, 2013: CFP closes
August 12, 2013: Community voting begins
September 1, 2013: Community voting ends
September 22, 2013: All speakers notified of submission status

Top Five Reasons to Attend (according to conference organizers):

  • Learn:  Meet, socialize, collaborate, and network with fellow Lucene/Solr enthusiasts.
  • Innovate:  From field-collapsing to flexible indexing to integration with NoSQL technologies, you get the freshest thinking on solving the deepest, most interesting problems in open source search and big data.
  • Connect: The power of open source is demolishing traditional barriers and forging new opportunity for killer code and new search apps.
  • Enjoy:  We’ve scheduled fun into the conference! Networking breaks, Stump-the-Chump, Lightning talks and a big conference party!
  • Save:  Take advantage of packaged deals on accelerated two-day training workshops, coupled with conference sessions on real-world implementations presented by Solr/Lucene experts.

Let’s be honest. The real reason to attend is Dublin, Ireland in early November. (On average, 22 rainy days in November.) 😉

Take an umbrella, extra sweater or coat and enjoy!

July 24, 2013

Building a Real-time, Big Data Analytics Platform with Solr

Filed under: Analytics,BigData,Solr — Patrick Durusau @ 6:50 pm

Building a Real-time, Big Data Analytics Platform with Solr by Trey Grainger.

Description:

Having “big data” is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.

At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You’ll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.

The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you’ll never see Solr as just a text search engine again.

Trey proposes a paradigm shift from document retrieval with Solr and towards returning aggregated information as a result of Solr searches.

Brief overview of faceting with examples of different visual presentations of returned facet data.

Rocking demonstration of the power of facets to power analytics! (Caveat: Yes, facets can do that, but good graphics is another skill entirely.)

Every customer has their own Solr index. (That’s a good idea.)

Implemented A/B testing using Solr. (And shows how he created it.)

This is a great presentation!

BTW, Trey is co-authoring: Solr in Action.

Improve search relevancy…

Filed under: Relevance,Searching,Solr — Patrick Durusau @ 4:05 pm

Improve search relevancy by telling Solr exactly what you want by Doug Turnbull.

From the post:

To be successful, (e)dismax relies on avoiding a tricky problem with its scoring strategy. As we’ve discussed, dismax scores documents by taking the maximum score of all the fields that match a query. This is problematic as one field’s scores can’t easily be related to another’s. A good “text” match might have a score of 2, while a bad “title” score might be 10. Dismax doesn’t have a notion that “10” is bad for title, it only knows 10 > 2, so title matches dominate the final search results.

The best case for dismax is that there’s only one field that matches a query, so the resulting scoring reflects the consistency within that field. In short, dismax thrives with needle-in-a-haystack problems and does poorly with hay-in-a-haystack problems.

We need a different strategy for documents that have fields with a large amount of overlap. We’re trying to tell the difference between very similar pieces of hay. The task is similar to needing to find a good candidate for a job. If we wanted to query a search index of job candidates for “Solr Java Developer”, we’ll clearly match many different sections of our candidates’ resumes. Because of problems with dismax, we may end up with search results heavily sorted on the “objective” field.

(…)

Not unlike my comments yesterday about the similarity of searching and playing the lottery. The more you invest in the search, the more likely you are to get good results.

Doug analyzes what criteria should data meet in order to be a “good” result.

For a topic map, I would analyze what data does a subject need in order to be found by a typical request.

Both address the same problem, search, but from very different perspectives.

Apache Lucene 4.4 and Apache SolrTM 4.4 available

Filed under: Lucene,Solr — Patrick Durusau @ 3:54 pm

Lucene: http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.txt

Solr: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr CHANGES.txt

If you follow Lucene/Solr you have probably already heard the news.

There are nineteen (19) new features in Lucene and twenty (20) in Solr so don’t neglect the release notes.

Spend some time with both releases. I don’t think you will be disappointed.

July 14, 2013

Solr vs ElasticSearch

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 7:14 pm

Solr vs ElasticSearch by Ryan Tabora.

Ryan evaluates Solr and ElasticSearch (both based on Lucene) in these categories:

  1. Foundations
  2. Coordination
  3. Shard Splitting
  4. Automatic Shard Rebalancing
  5. Schema
  6. Schema Creation
  7. Nested Typing
  8. Queries
  9. Distributed Group By
  10. Percolation Queries
  11. Community
  12. Vendor Support

As Ryan points out, making a choice between Solr and ElasticSearch requires detailed knowledge of your requirements.

If you are a developer, I would suggest following Lucene, as well as Solr and ElasticSearch.

No one tool is going to be the right tool for every job.

July 8, 2013

Advanced autocomplete with Solr Ngrams

Filed under: AutoComplete,AutoSuggestion,Solr — Patrick Durusau @ 6:54 pm

Advanced autocomplete with Solr Ngrams by Peter Tyrrell.

From the post:

The following approach is a good one if you require:

  • phrase suggestions, not just words
  • the ability to match user input against multiple fields
  • multiple fields returned
  • multiple field values to make up a unique suggestion
  • suggestion results collapsed (grouped) on a field or fields
  • the ability to filter the query
  • images with suggestions

I needed a typeahead suggestion (autocomplete) solution for a textbox that searches titles. In my case, I have a lot of magazines that are broken down so that each page is a document in the Solr index, and has metadata that describes its parentage. For example, page 1 of Dungeon Magazine 100 has a title: “Dungeon 100”; a collection; “Dungeon Magazine”; and a universe: “Dungeons and Dragons”. (Yes, all the material in my index is related to RPG in some way.) A magazine like this might consist of 70 pages or so, whereas a sourcebook like the Core Rulebook for Pathfinder, a D&D variant, boasts 578, so title suggestions have to group on title and ignore counts. Further, the Warhammer 40k game Dark Heresy also has a Core Rulebook, so title suggestions have to differentiate between them.

(…)

Topic map interfaces with autosuggest/complete could ease users into searching and authoring topic maps.

Better synonym handling in Solr

Filed under: Search Engines,Solr,Synonymy — Patrick Durusau @ 6:45 pm

Better synonym handling in Solr by Nolan Lawson.

From the post:

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

(image omitted)

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

Deep review of the handling of synonyms in Solr and a patch to improve its handling of synonyms.

The issue is now SOLR-4381 and is set for SOLR 4.4.

Interesting discussion continues under the SOLR issue.

July 7, 2013

Spatial Search With Apache Solr and Google Maps

Filed under: Google Maps,JQuery,Solr — Patrick Durusau @ 2:48 pm

Spatial Search With Apache Solr and Google Maps by Wern Ancheta.

From the post:

In this tutorial I’m going to show you how to setup spatial search in Apache Solr then were going to create an application which uses Spatial searching with the use of Google Maps.

You will also learn about geocoding and JQuery as part of this tutorial.

For the purposes of this tutorial were going to use Spatial search to find the locations which are near the place that we specify.

If you have a cellphone contract or geolocation you can find who lives nearby. 😉

Assuming you have that kind of data.

July 4, 2013

Clustering Search with Carrot2

Filed under: Clustering,Solr — Patrick Durusau @ 2:43 pm

Clustering Search with Carrot2 by Ian Milligan.

From the post:

My work is taking me to larger and larger datasets, so finding relevant information has become a real challenge – I’ve dealt with this before, noting DevonTHINK as an alternative to something slow and cumbersome like OS X’s Spotlight. As datasets scale, keyword searching and n-gram counting has also shown some limitations.

One approach that I’ve been taking is to try to implement a clustering algorithm on my sources, as well as indexing them for easy retrieval. I wanted to give you a quick sense of my workflow in this post.

Brief but useful tutorial on using Solr and Carrot2.

June 30, 2013

Solr Authors, A Suggestion

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 3:01 pm

I am working my way through a recent Solr publication. It reproduces some, but not all of the output of queries.

But it remains true that the output of queries is a sizeable portion of the text.

Suggestion: Could the queries be embedded in PDF text as hyperlinks?

Thus: http://localhost:8983/solr/select?q=*:*&indent=yes.

If I have Solr running, etc., the full results show up in my browser and save page space. Perhaps resulting in room for more analysis or examples.

There may be a very good reason to not follow my suggestion so it truly is a suggestion.

If there is a question of verifying the user’s results, perhaps a separate PDF of results keyed to the text?

That could be fuller results and at the same time allow the text to focus on substantive material.

June 29, 2013

Indexing data in Solr…

Filed under: Apache Camel,Indexing,Solr — Patrick Durusau @ 12:44 pm

Indexing data in Solr from disparate sources using Camel by Bilgin Ibryam.

From the post:

Apache Solr is ‘the popular, blazing fast open source enterprise search platform’ built on top of Lucene. In order to do a search (and find results) there is the initial requirement of data ingestion usually from disparate sources like content management systems, relational databases, legacy systems, you name it… Then there is also the challenge of keeping the index up to date by adding new data, updating existing records, removing obsolete data. The new sources of data could be the same as the initial ones, but could also be sources like twitter, AWS or rest endpoints.

Solr can understand different file formats and provides fair amount of options for data indexing:

  1. Direct HTTP and remote streaming – allows you to interact with Solr over HTTP by posting a file for direct indexing or the path to the file for remote streaming.
  2. DataImportHandler – is a module that enables both full and incremental delta imports from relational databases or file system.
  3. SolrJ – a java client to access Solr using Apache Commons HTTP Client.

But in real life, indexing data from different sources with millions of documents, dozens of transformations, filtering, content enriching, replication, parallel processing requires much more than that. One way to cope with such a challenge is by reinventing the wheel: write few custom applications, combine them with some scripts or run cronjobs. Another approach would be to use a tool that is flexible and designed to be configurable and plugable, that can help you to scale and distribute the load with ease. Such a tool is Apache Camel which has also a Solr connector now.

(…)

Avoid reinventing the wheel: check mark

Robust software: check mark

Name recognition of Lucene/Solr: check mark

Name recognition of Camel: check mark

Do you see any negatives?

BTW, the examples that round out Bilgin’s post are quite useful!

June 28, 2013

Poor man’s “entity” extraction with Solr

Filed under: Entity Extraction,Solr — Patrick Durusau @ 3:38 pm

Poor man’s “entity” extraction with Solr by Erik Hatcher.

From the post:

My work at LucidWorks primarily involves helping customers build their desired solutions. Recently, more than one customer has inquired about doing “entity extraction”. Entity extraction, as defined on Wikipedia, “seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.” When drilling down into the specifics of the requirements from our customers, it turns out that many of them have straightforward solutions using built-in (Solr 4.x) components, such as:

  • Acronyms as facets
  • Key words or phrases, from a fixed list, as facets
  • Lat/long mentions as geospatial points

This article will describe and demonstrate how to do these, and as a bonus we’ll also extract URLs found in text too. Let’s start with an example input and the corresponding output all of the described techniques provides.

If you have been thinking about experimenting with Solr, Erik touches on some of its features by example.

June 27, 2013

Apache Solr volume 1 -….

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 1:13 pm

Apache Solr V[olume] 1 – Introduction, Features, Recency Ranking and Popularity Ranking by Ramzi Alqrainy.

I amended the title to expand v for volume. Just seeing the “v” made me think version. No true in this case.

Nothing new or earthshaking but a nice overview of Solr.

It is a “read along” slide deck so the absence of a presenter won’t impair its usefulness.

June 26, 2013

Scaling Through Partitioning and Shard Splitting in Solr 4 (Webinar)

Filed under: Indexing,Search Engines,Solr — Patrick Durusau @ 3:28 pm

Scaling Through Partitioning and Shard Splitting in Solr 4 by Timothy Potter.

Date: Thursday, July 18, 2013
Time: 10:00am Pacific Time

From the post:

Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we’ve scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.

In practice, it’s common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you’ll learn about new features in Solr to help manage large-scale clusters. Specifically, we’ll cover data partitioning and shard splitting.

Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We’ll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.

Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.

Just in time for when you finish your current Solr reading! 😉

Definitely on the calendar!

Apache Bigtop: The “Fedora of Hadoop”…

Filed under: Bigtop,Crunch,DataFu,Flume,Giraph,HBase,HCatalog,Hive,Hue,Mahout,Oozie,Pig,Solr,Sqoop,Zookeeper — Patrick Durusau @ 10:45 am

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉

Enjoy!

(Participate if you can but at least send a note of appreciation to Cloudera.)

June 25, 2013

Apache Solr Reference Guide (Solr v4.3)

Filed under: Searching,Solr — Patrick Durusau @ 5:31 pm

Apache Solr Reference Guide (Solr v4.3) by Cassandra Targett.

From the TOC page:

Getting Started: This section guides you through the installation and setup of Solr.

Using the Solr Administration User Interface: This section introduces the Solr Web-based user interface. From your browser you can view configuration files, submit queries, view logfile settings and Java environment settings, and monitor and control distributed configurations.

Documents, Fields, and Schema Design: This section describes how Solr organizes its data for indexing. It explains how a Solr schema defines the fields and field types which Solr uses to organize data within the document files it indexes.

Understanding Analyzers, Tokenizers, and Filters: This section explains how Solr prepares text for indexing and searching. Analyzers parse text and produce a stream of tokens, lexical units used for indexing and searching. Tokenizers break field data down into tokens. Filters perform other transformational or selective work on token streams.

Indexing and Basic Data Operations: This section describes the indexing process and basic index operations, such as commit, optimize, and rollback.

Searching: This section presents an overview of the search process in Solr. It describes the main components used in searches, including request handlers, query parsers, and response writers. It lists the query parameters that can be passed to Solr, and it describes features such as boosting and faceting, which can be used to fine-tune search results.

The Well-Configured Solr Instance: This section discusses performance tuning for Solr. It begins with an overview of the solrconfig.xml file, then tells you how to configure cores with solr.xml, how to configure the Lucene index writer, and more.

Managing Solr: This section discusses important topics for running and monitoring Solr. It describes running Solr in the Apache Tomcat servlet runner and Web server. Other topics include how to back up a Solr instance, and how to run Solr with Java Management Extensions (JMX).

SolrCloud: This section describes the newest and most exciting of Solr’s new features, SolrCloud, which provides comprehensive distributed capabilities.

Legacy Scaling and Distribution: This section tells you how to grow a Solr distribution by dividing a large index into sections called shards, which are then distributed across multiple servers, or by replicating a single index across multiple services.

Client APIs: This section tells you how to access Solr through various client APIs, including JavaScript, JSON, and Ruby.

Well, I know what I am going to be reading in the immediate future. 😉

June 22, 2013

Lucene/Solr Revolution EU 2013

Filed under: Conferences,Lucene,LucidWorks,Solr — Patrick Durusau @ 4:49 pm

Lucene/Solr Revolution EU 2013

November 4 -7, 2013
Dublin, Ireland

Abstract Deadline: August 2, 2013.

From the webpage:

LucidWorks is proud to present Lucene/Solr Revolution EU 2013, the biggest open source conference dedicated to Apache Lucene/Solr.

The conference, held in Dublin, Ireland on November 4-7, will be packed with technical sessions, developer content, user case studies, and panels. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology.

From the call for papers:

The Call for Papers for Lucene/Solr Revolution EU 2013 is now open.

Lucene/Solr Revolution is the biggest open source conference dedicated to Apache Lucene/Solr. The great content delivered by speakers like you is the heart of the conference. If you are a practitioner, business leader, architect, data scientist or developer and have something important to share, we welcome your submission.

We are particularly interested in compelling use cases and success stories, best practices, and technology insights.

Don’t be shy!

« Newer PostsOlder Posts »

Powered by WordPress