Videos from Lucene/Solr Revolution 2013 San Diego (April 29th – May 2nd, 2013)
Sorted by author, duplicates removed, etc.
These videos merit far more views than they have today. Pass this list along.
Work through the videos and related docs. There are governments out there that want useful search results.
James Atherton, Search Team Lead, 7digital Implementing Search with Solr at 7digital
A usage/case study, describing our journey as we implemented Lucene/Solr, the lessons we learned along the way and where we hope to go in the future.How we implemented our instant search/search suggest. How we handle trying to index 400 million tracks and metadata for over 40 countries, comprising over 300GB of data, and about 70GB of indexes. Finally where we hope to go in the future.
Ben Brown, Software Architect, Cerner Corporation Brahe – Mass scale flexible indexing
Our team made their first foray into Solr building out Chart Search, an offering on top of Cerner's primary EMR to help make search over a patient's chart smarter and easier. After bringing on over 100 client hospitals and indexing many tens of billions of clinical documents and discrete results we've (thankfully) learned a couple of things.
The traditional hashed document ID over many shards and no easily accessible source of truth doesn't make for a flexible index.
Learn the finer points of the strategy where we shifted our source of truth to HBase. How we deploy new indexes with the click of a button, take an existing index and expand the number of shards on the fly, and several other fancy features we enabled.
Paul Doscher, CEO LucidWorks Lucene Revolution 2013, Opening Remarks – Paul Doscher, CEO LucidWorks
Ted Dunning, Chief Application Architect, MapR & Grant Ingersoll, Chief Technology Officer, LucidWorks Crowd-sourced intelligence built into Search over Hadoop
Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.
Stephane Gamard, Chief Technology Officer, Searchbox How to make a simple cheap high-availability self-healing Solr cluster
In this presentation we aim to show how to make a high availability Solr cloud with 4.1 using only Solr and a few bash scripts. The goal is to present an infrastructure which is self healing using only cheap instances based on ephemeral storage. We will start by providing a comprehensive overview of the relation between collections, Solr cores, shardes, and cluster nodes. We continue by an introduction to Solr 4.x clustering using zookeeper with a particular emphasis on cluster state status/monitoring and solr collection configuration. The core of our presentation will be demonstrated using a live cluster.
We will show how to use cron and bash to monitor the state of the cluster and the state of its nodes. We will then show how we can extend our monitoring to auto generate new nodes, attach them to the cluster, and assign them shardes (selecting between missing shardes or replication for HA). We will show that using a high replication factor it is possible to use ephemeral storage for shards without the risk of data loss, greatly reducing the cost and management of the architecture. Future work discussions, which might be engaged using an open source effort, include monitoring activity of individual nodes as to scale the cluster according to traffic and usage.
Trey Grainger, Search Technology Development Manager, CareerBuilder Building a Real-time, Big Data Analytics Platform with Solr
Having "big data" is great, but turning that data into actionable intelligence is where the real value lies. This talk will demonstrate how you can use Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
At CareerBuilder, we utilize these techniques to report the supply and demand of the labor force, compensation trends, customer performance metrics, and many live internal platform analytics. You will walk away from this talk with an advanced understanding of faceting, including pivot-faceting, geo/radius faceting, time-series faceting, function faceting, and multi-select faceting. You'll also get a sneak peak at some new faceting capabilities just wrapping up development including distributed pivot facets and percentile/stats faceting, which will be open-sourced.
The presentation will be a technical tutorial, along with real-world use-cases and data visualizations. After this talk, you'll never see Solr as just a text search engine again.
Chris Hostetter (aka Hoss) Stump The Chump: Get On The Spot Solutions To Your Real Life Lucene/Solr Challenges
Got a tough problem with your Solr or Lucene application? Facing
challenges that you'd like some advice on? Looking for new approaches to
overcome a Lucene/Solr issue? Not sure how to get the results you
expected? Don't know where to get started? Then this session is for you.
Now, you can get your questions answered live, in front of an audience of
hundreds of Lucene Revolution attendees! Back again by popular demand,
"Stump the Chump" at Lucene Revolution 2013 puts Chris Hostetter (aka Hoss) in the hot seat to tackle questions live.
All you need to do is send in your questions to us here at
firstname.lastname@example.org. You can ask anything you like, but consider
topics in areas like: Data modelling Query parsing Tricky faceting Text analysis Scalability
You can email your questions to email@example.com. Please
describe in detail the challenge you have faced and possible approach you
have taken to solve the problem. Anything related to Solr/Lucene is fair game.
Our moderator, Steve Rowe, will will read the questions, and Hoss have to formulate a solution on the spot. A panel of judges will decide if he has provided an effective answer. Prizes will be awarded by the panel for the best question – and for those deemed to have "Stumped the Chump".
Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd Building a Near Real-time Search Engine and Analytics for logs using Solr
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
Mikhail Khludnev, eCommerce Search Platform, Grid Dynamics Concept Search for eCommerce with Solr
This talk describes our experience in eCommerce Search: challenges which we've faced and the chosen approaches. It's not indented to be a full description of implementation, because too many details need to be touched. This talk is more like problem statement and general solutions description, which have a number of points for technical or even academic discussion. It's focused on text search use-case, structures (or scoped) search is out of agenda as well as faceted navigation.
Hilary Mason, Chief Scientist, bitly Search is not a solved problem.
Remi Mikalsen, Search Engineer, The Norwegian Centre for ICT in Education Multi-faceted responsive search, autocomplete, feeds engine and logging
Learn how utdanning.no leverages open source technologies to deliver a blazing fast multi-faceted responsive search experience and a flexible and efficient feeds engine on top of Solr 3.6. Among the key open source projects that will be covered are Solr, Ajax-Solr, SolrPHPClient, Bootstrap, jQuery and Drupal. Notable highlights are ajaxified pivot facets, multiple parents hierarchical facets, ajax autocomplete with edge-n-gram and grouping, integrating our search widgets on any external website, custom Solr logging and using Solr to deliver Atom feeds. utdanning.no is a governmental website that collects, normalizes and publishes study information for related to secondary school and higher education in Norway. With 1.2 million visitors each year and 12.000 indexed documents we focus on precise information and a high degree of usability for students, potential students and counselors.
Mark Miller, Software Engineer, Cloudera SolrCloud: the 'Search First' NoSQL database
As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling — two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
Dragan Milosevic, Senior Architect, zanox Analytics in OLAP with Lucene and Hadoop
Analytics powered by Hadoop is powerful tool and this talk addresses its application in OLAP built on top of Lucene. Many applications use Lucene indexes also for storing data to alleviate challenges concerned with external data sources. The analyses of queries can reveal stored fields that are in most cases accessed together. If one binary compressed field replaces those fields, amount of data to be loaded is reduced and processing of queries is boosted. Furthermore, documents that are frequently loaded together can be identified. If those documents are saved in almost successive positions in Lucene stored files, benefits from file-system caches are improved and loading of documents is noticeably faster.
Large-scale searching applications typically deploy sharding and partition documents by hashing. The implemented OLAP has shown that such hash-based partitioning is not always an optimal one. An alternative partitioning, supported by analytics, has been developed. It places documents that are frequently used together in same shards, which maximizes the amount of work that can be locally done and reduces the communication overhead among searchers. As an extra bonus, it also identifies slow queries that typically point to emerging trends, and suggests the addition of optimized searchers for handling similar queries.
Christian Moen, Software Engineer, Atilika Inc. Language support and linguistics in Lucene/Solr and its eco-system
In search, language handling is often key to getting a good search experience. This talk gives an overview of language handling and linguistics functionality in Lucene/Solr and best-practices for using them to handle Western, Asian and multi-language deployments. Pointers and references within the open source and commercial eco-systems for more advanced linguistics and their applications are also discussed.
The presentation is mix of overview and hands-on best-practices the audience can benefit immediately from in their Lucene/Solr deployments. The eco-system part is meant to inspire how more advanced functionality can be developed by means of the available open source technologies within the Apache eco-system (predominantly) while also highlighting some of the commercial options available.
Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix Rapid pruning of search space through hierarchical matching
This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.
Kathy Phillips, Enterprise Search Services Manager/VP, Wells Fargo & Co. & Tom Lutmer, eBusiness Systems Consultant, Enterprise Search Services team, Wells Fargo & Co Beyond simple search — adding business value in the enterprise
What is enterprise search? Is it a single search box that spans all enterprise resources or is it much more than that? Explore how enterprise search applications can move beyond simple keyword search to add unique business value. Attendees will learn about the benefits and challenges to different types of search applications such as site search, interactive search, search as business intelligence, and niche search applications. Join the discussion about the possibilities and future direction of new business applications within the enterprise.
David Piraino and Daniel Palmer, Chief Imaging Information Officers, Imaging Institute Cleveland Clinic, Cleveland Clinic Next Generation Electronic Medical Records and Search: A Test Implementation in Radiology
Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.
Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.
An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.
Timothy Potter, Architect, Big Data Analytics, Dachis Group Scaling up Solr 4.1 to Power Big Search in Social Media Analytics
My presentation focuses on how we implemented Solr 4.1 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 500,000,000 documents and is growing by 3 to 4 million documents per day.
The presentation will include details about:
Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper
Operations concerns like how to handle a failed node and monitoring
How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput
Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4.1 is scalable, stable, and is production ready. (note: we are in production on 18 nodes in EC2 with a recent nightly build off the branch_4x).
Ingo Renner, Software Engineer, Infield Design CMS Integration of Apache Solr – How we did it.
TYPO3 is an Open Source Content Management System that is very popular in Europe, especially in the German market, and gaining traction in the U.S., too.
TYPO3 is a good example of how to integrate Solr with a CMS. The challenges we faced are typical of any CMS integration. We came up with solutions and ideas to these challenges and our hope is that they might be of help for other CMS integrations as well.
That includes content indexing, file indexing, keeping track of content changes, handling multi-language sites, search and facetting, access restrictions, result presentation, and how to keep all these things flexible and re-usable for many different sites.
For all these things we used a couple additional Apache projects and we would like to show how we use them and how we contributed back to them while building our Solr integration.
David Smiley, Software Systems Engineer, Lead, MITRE Lucene / Solr 4 Spatial Deep Dive
Lucene's former spatial contrib is gone and in its place is an entirely new spatial module developed by several well-known names in the Lucene/Solr spatial community. The heart of this module is an approach in which spatial geometries are indexed using edge-ngram tokenized geohashes searched with a prefix-tree/trie recursive algorithm. It sounds cool and it is! In this presentation, you'll see how it works, why it's fast, and what new things you can do with it. Key features are support for multi-valued fields, and indexing shapes with area — even polygons, and support for various spatial predicates like "Within". You'll see a live demonstration and a visual representation of geohash indexed shapes. Finally, the session will conclude with a look at the future direction of the module.
David Smiley, Software Systems Engineer, Lead, MITRE Text Tagging with Finite State Transducers
OpenSextant is an unstructured-text geotagger. A core component of OpenSextant is a general-purpose text tagger that scans a text document for matching multi-word based substrings from a large dictionary. Harnessing the power of Lucene's state-of-the-art finite state transducer (FST) technology, the text tagger was able to save over 40x the amount of memory estimated for a leading in-memory alternative. Lucene's FSTs are elusive due to their technical complexity but overcoming the learning curve can pay off handsomely.
Marc Sturlese, Architect, Backend engineer, Trovit Batch Indexing and Near Real Time, keeping things fast
In this talk I will explain how we combine a mixed architecture using Hadoop for batch indexing and Storm, HBase and Zookeeper to keep our indexes updated in near real time.Will talk about why we didn't choose just a default Solr Cloud and it's real time feature (mainly to avoid hitting merges while serving queries on the slaves) and the advantages and complexities of having a mixed architecture. Both parts of the infrastucture and how they are coordinated will be explained with details.Finally will mention future lines, how we plan to use Lucene real time feature.
Tyler Tate, Cofounder, TwigKit Designing the Search Experience
Search is not just a box and ten blue links. Search is a journey: an exploration where what we encounter along the way changes what we seek. But in order to guide people along this journey, we must understand both the art and science of search.In this talk Tyler Tate, cofounder of TwigKit and coauthor of the new book Designing the Search Experience, weaves together the theories of information seeking with the practice of user interface design, providing a comprehensive guide to designing search.Pulling from a wealth of research conducted over the last 30 years, Tyler begins by establishing a framework of search and discovery. He outlines cognitive attributes of users—including their level of expertise, cognitive style, and learning style; describes models of information seeking and how they've been shaped by theories such as information foraging and sensemaking; and reviews the role that task, physical, social, and environmental context plays in the search process.
Tyler then moves from theory to practice, drawing on his experience of designing 50+ search user interfaces to provide practical guidance for common search requirements. He describes best practices and demonstrates reams of examples for everything from entering the query (including the search box, as-you-type suggestions, advanced search, and non-textual input), to the layout of search results (such as lists, grids, maps, augmented reality, and voice), to result manipulation (e.g. pagination and sorting) and, last but not least, the ins-and-outs of faceted navigation. Through it all, Tyler also addresses mobile interface design and how responsive design techniques can be used to achieve cross-platform search.This intensive talk will enable you to create better search experiences by equipping you with a well-rounded understanding of the theories of information seeking, and providing you with a sweeping survey of search user interface best practices.
Troy Thomas, Senior Manager, Internet Enabled Services, Synopsys & Koorosh Vakhshoori, Software Architect,Synopsys Make your GUI Shine with AJAX-Solr
Adrian Trenaman, Senior Software Engineer, Gilt Groupe Personalized Search on the Largest Flash Sale Site in America
Gilt Groupe is an innovative online shopping destination offering its members special access to the most inspiring merchandise, culinary offerings, and experiences every day, many at insider prices. Every day new merchandising is offered for sale at discounts of up to 70%. Sales start at 12 noon EST resulting in an avalanche of hits to the site, so delivering a rich user experience requires substantial technical innovation.
Implementing search for a flash-sales business, where inventory is limited and changes rapidly as our sales go live to a stampede of members every noon, poses a number of technical challenges. For example, with small numbers of fast moving inventory we want to be sure that search results reflect those products we still have available for sale. Also, personalizing search — where search listings may contain exclusive items that are available only to certain users — was also a big challenge
Gilt has built out keyword search using Scala, Play Framework and Apache Solr / Lucene. The solution, which involves less than 4,000 lines of code, comfortably provides search results to members in under 40ms. In this talk, we'll give a tour of the logical and physical architecture of the solution, the approach to schema definition for the search index, and how we use custom filters to perform personalization and enforce product availability windows. We'll discuss lessons learnt, and describe how we plan to adopt Solr to power sale, brand, category and search listings throughout all of Gilt's estate.
Doug Turnbull, Search and Big Data Architect, OpenSource Connections State Decoded: Empowering The Masses with Open Source State Law Search
The Law has traditionally been a topic dominated by an elite group of experts. Watch how State Decoded has transformed the law from a scary, academic topic to a friendly resource that empowers everyone using Apache Solr. This talk is a call to action for discovery and design to break open ivory towers of expertise by baking rich discovery into your UI and data structures.