Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 27, 2013

Google Alters Search… [Pushy Suggestions]

Filed under: Google Knowledge Graph,Search Engines,Searching — Patrick Durusau @ 4:27 pm

Google Alters Search to Handle More Complex Queries by Claire Cain Miller.

From the post:

Google on Thursday announced one of the biggest changes to its search engine, a rewriting of its algorithm to handle more complex queries that affects 90 percent of all searches.

The change, which represents a new approach to search for Google, required the biggest changes to the company’s search algorithm since 2000. Now, Google, the world’s most popular search engine, will focus more on trying to understand the meanings of and relationships among things, as opposed to its original strategy of matching keywords.

The company made the changes, executives said, because Google users are asking increasingly long and complex questions and are searching Google more often on mobile phones with voice search.

“They said, ‘Let’s go back and basically replace the engine of a 1950s car,’ ” said Danny Sullivan, founding editor of Search Engine Land, an industry blog. “It’s fair to say the general public seemed not to have noticed that Google ripped out its engine while driving down the road and replaced it with something else.”

One of the “other” changes is “pushy suggestions.”

In the last month I have noticed that if my search query is short that I will get Google’s suggested completion rather than my search request.

How short? Just has to be shorter than the completion suggested by Google.

A simple return means it adopts its suggestion and not your request.

You don’t believe me?

OK, type in:

charter

Note the autocompletion to:

charter.com

That’s OK if I am searching for the cable company but not if I am searching for “charter” as in a charter for technical work.

I am required to actively avoid the suggestion by Google.

I can avoid Google’s “pushy suggestions” by hitting the space bar.

But like many people, I toss off Google searches without ever looking at the search or URL box. I don’t look up until I have the results. And now sometimes the wrong results.

I would rather have a search engine execute my search by default and its suggestions only when asked.

How about you?

September 26, 2013

Explore Your Data with Elasticsearch

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 2:36 pm

From the description:

As Honza Kral puts it, “Elasticsearch is a very buzz-word compliant piece of software.” By this he means, it’s open source, it can do REST, JSON, HTTP, it has real time, and even Lucene is somewhere in there. What does this all really mean? Well, simply, Elasticsearch is a distributed data store that’s very good at searching and analyzing data.

Honza, a Python programmer and Django core developer, visits SF Python, to show off what this powerful tool can do. He uses real data to demonstrate how Elasticsearch’s real-time analytics and visualizations tools can help you make sense of your application.

Follow along with Honza’s slides: http://crcl.to/6tdvs

There are clients for ElasticSearch so don’t worry about the deeply nested brackets in the examples. 😉

A very good presentation on exploring data with ElasticSearch.

September 24, 2013

Three exciting Lucene features in one day

Filed under: Lucene,Search Engines — Patrick Durusau @ 4:21 pm

Three exciting Lucene features in one day by Mike McCandless.

From the post:

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the “long tail” of possible suggestions based on the 1 or 2 previous tokens.

By anybody’s count, that was an extraordinary day!

Drop by Mike’s post for the details.

September 22, 2013

…Introducing … Infringing Content Online

Filed under: Intellectual Property (IP),Search Engines,Searching — Patrick Durusau @ 12:43 pm

New Study Finds Search Engines Play Critical Role in Introducing Audiences To Infringing Content Online

From the summary at Full Text Reports:

Today, MPAA Chairman Senator Chris Dodd joined Representatives Howard Coble, Adam Schiff, Marsha Blackburn and Judy Chu on Capitol Hill to release the results of a new study that found that search engines play a significant role in introducing audiences to infringing movies and TV shows online. Infringing content is a TV show or movie that has been stolen and illegally distributed online without any compensation to the show or film’s owner.

The study found that search is a major gateway to the initial discovery of infringing content online, even in cases when the consumer was not looking for infringing content. 74% of consumers surveyed cited using a search engine as a navigational tool the first time they arrived at a site with infringing content. And the majority of searches (58%) that led to infringing content contained only general keywords — such as the titles of recent films or TV shows, or phrases related to watching films or TV online — and not specific keywords aimed at finding illegitimate content.

I rag on search engines fairly often about the quality of their results so in light of this report, I wanted to give them a shout out of: Well done!

They may not be good at the sophisticated content discovery that I find useful, but on the other hand, when sweat hogs are looking for entertainment, search content can fill the bill.

On the other hand, knowing that infringing content can be found may be good for PR purposes but not much more. Search results don’t capture (read identify) enough subjects to enable the mining of patterns of infringement and other data analysis relevant to opposing to infringement.

Infringing content is easy to find so the business case for topic maps lies with content providers. Who need more detail (read subjects and associations) than a search engine can provide.

New Study Finds Search Engines Play Critical Role in Introducing Audiences To Infringing Content Online (PDF of the news release)


Update: Understanding the Role of Search in Online Piracy. The full report. Additional detail but no links to the data.

September 16, 2013

Building better search tools: problems and solutions

Filed under: Search Behavior,Search Engines,Searching — Patrick Durusau @ 4:38 pm

Building better search tools: problems and solutions by Vincent Granville

From the post:

Have you ever done a Google search for mining data? It returns the same results as for data mining. Yet these are two very different keywords: mining data usually means data about mining. And if you search for data about mining you still get the same results anyway.

(graphic omitted)

Yet Google has one of the best search algorithms. Imagine an e-store selling products, allowing users to search for products via a catalog powered with search capabilities, but returning irrelevant results 20% of the time. What a loss of money! Indeed, if you were an investor looking on Amazon to purchase a report on mining data, all you will find are books on data mining and you won’t buy anything: possibly a $500 loss for Amazon. Repeat this million times a year, and the opportunity cost is in billions of dollars.

There are a few issues that make this problem difficult to fix. While the problem is straightforward for decision makers, CTO’s or CEO’s to notice, understand and assess the opportunity cost (just run 200 high value random search queries, see how many return irrelevant results), the communication between the analytic teams and business people is faulty: there is a short somewhere.

There might be multiple analytics teams working as silos – computer scientists, statisticians, engineers – sometimes aggressively defending their own turfs and having conflicting opinions. What the decision makers eventually hears is a lot of noise and lots of technicalities, and they don’t know how to start, how much it will cost to fix it, and how complex the issue is, and who should fix it.

Here I discuss the solution and explain it in very simple terms, to help any business having a search engine and an analytic team, easily fix the issue.

Vincent has some clever insights into this particular type of search problem but I think it falls short of being “easily” fixed.

Read his original post and see if you think the solution is an “easy” one.

September 5, 2013

Introducing Cloudera Search

Filed under: Cloudera,Hadoop,Search Engines — Patrick Durusau @ 6:15 pm

Introducing Cloudera Search

Cloudera Search 1.0 has hit the streets!

Download

Prior coverage of Cloudera Search: Hadoop for Everyone: Inside Cloudera Search.

Enjoy!

August 28, 2013

Building a distributed search system

Filed under: Distributed Computing,Hadoop,Lucene,Search Engines — Patrick Durusau @ 2:13 pm

Building a distributed search system with Apache Hadoop and Lucene by Mirko Calvaresi.

From the preface:

This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (1012bytes) or Petabyte (1015bytes) and with an exponential growth rate. We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing:in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.

We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene). The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers. The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data. Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.

Fairly complete (75 pages) report on a project indexing academic papers with Lucene and Hadoop.

I would like to see treatment of the voiced demand for “real-time processing” versus the need for “real-time processing.”

When I started using research tools, indexes, like the Readers Guide to Periodical Literature were at a minimum two (2) weeks behind popular journals.

Academic indexes ran that far behind if not a good bit longer.

The timeliness of indexing journal articles is now nearly simultaneous with publication.

Has the quality of our research improved due to faster access?

I can imagine use cases, drug interactions for example, the discovery of which should be streamed out as soon as practical.

But drug interactions are not the average case.

It would be very helpful to see research on what factors favor “real-time” solutions and which are quite sufficient with “non-real-time” solutions.

August 22, 2013

Antepedia…

Filed under: Programming,Search Engines,Software — Patrick Durusau @ 6:18 pm

Antepedia Open Source Project Search Engine

From the “more” information link on the homepage:

Antepedia is the largest knowledge base of open source components with over 2 million current projects, and 1,000 more added daily. Antepedia continuously aggregates data from various directories that include Google Code, Apache, GitHub, Maven, and many more. These directories allow Antepedia to consistently grow as the world’s largest knowledge base of open source components.

Antepedia helps companies protect and secure their software assets, by providing a multi-source tracking solution that assists them in their management of open source governance. This implementation of Antepedia allows an organization to reduce licensing risks and security vulnerabilities in your open source component integration.

Antepedia is a public site that provides a way for anyone to search for an open source project. In cases where a project is not currently indexed in the knowledge base, you can manually submit that project, and help build upon the Antepedia knowledge base. These various benefits allow Antepedia to grow and offer the necessary functionalities, which provide the information you need, when you need it. With Antepedia you can assure that you have the newest & most relevant information for all your open source management and detection projects.

See also: Antepedia Reporter Free Edition for tracking open source projects.

If you like open source projects, take a look at: http://www.antelink.com/ (sponsor of Antepedia).

Do navigate on and off the Antelink homepage and watch the Antepedia counter increment, to the same number. 😉 I’m sure the total changes day to day but it was funny to see it reach the same number more than twice.

August 21, 2013

SuggestStopFilter carefully removes stop words for suggesters

Filed under: Lucene,Search Engines — Patrick Durusau @ 6:07 pm

SuggestStopFilter carefully removes stop words for suggesters by Michael McCandless.

Michael has tamed the overly “aggressive” StopFilter with SuggestStopFilter.

From the post:

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won’t change it. This way a query “a” can find “apple”, but a query “a ” (with a trailing space) will find nothing because the “a” will be removed.

I’ve pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

Have you noticed how quickly improvements for Lucene and Solr emerge?

August 20, 2013

Solr Tutorial [No Ads]

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 2:38 pm

Solr Tutorial from the Apache Software Foundation.

A great tutorial on Solr that is different from most of the Solr tutorials you will ever see.

There are no ads, popup or otherwise. 😉

Should be the first tutorial that you recommend for anyone new to Solr!

PS: You do not have to give your email address, phone number, etc. to view the tutorial.

August 13, 2013

Of collapsing in Solr

Filed under: Search Engines,Searching,Solr,Topic Maps — Patrick Durusau @ 4:35 pm

Of collapsing in Solr by Paul Masurel.

From the post:

This post is about the innerworkings of one of the two most popular open source search engines : Solr. I noticed that many questions (one or two everyday) on solr-user’s mailing list were about Solr’s collapsing functionality.

I thought it would be a good idea to explain how Solr’s collapsing is working. Because its documentation is very sparse, and because a search engine is the kind of car you to take a peek under the hood to make sure you’ll drive it right.

The Solr documentation at Apache refers to field collapsing and result grouping being “different ways to think about the same Solr feature.”

I read the post along with the Solr documentation.

BTW, note from “Known Limitations” in the Solr documentation:

Support for grouping on a multi-valued field has not yet been implemented.

That would be really nice with subjectIdentifier and subjectLocator having the potential to be sets of values.

Solr as an Analytics Platform

Filed under: Analytics,Search Engines,Solr — Patrick Durusau @ 4:19 pm

Solr as an Analytics Platform by Chris Becker.

From the post:

Here at Shutterstock we love digging into data. We collect large amounts of it, and want a simple, fast way to access it. One of the tools we use to do this is Apache Solr.

Most users of Solr will know it for its power as a full-text search engine. Its text analyzers, sorting, filtering, and faceting components provide an ample toolset for many search applications. A single instance can scale to hundreds of millions of documents (depending on your hardware), and it can scale even further through sharding. Modern web search applications also need to be fast, and Solr can deliver in this area as well.

The needs of a data analytics platform aren’t much different. It too requires a platform that can scale to support large volumes of data. It requires speed, and depends heavily on a system that can scale horizontally through sharding as well. And some of the main operations of data analytics – counting, slicing, and grouping — can be implemented using Solr’s filtering and faceting options.
(…)

A good introduction to obtaining useful results with Solr with a minimum of effort.

Certainly a good way to show ROI when you are convincing your manager to sponsor you for a Solr conference and/or training.

August 2, 2013

Norch – a search engine for node.js

Filed under: JSON,leveldb,node-js,Search Engines — Patrick Durusau @ 3:01 pm

Norch – a search engine for node.js by Fergus McDowall.

From the post:

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

  • Full text search
  • Stopword removal
  • Faceting
  • Filtering
  • Relevance weighting (tf-idf)
  • Field weighting
  • Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format

Download the first release of Norch (0.2.1) here

Not every feature possible but it looks like Norch covers the most popular ones.

August 1, 2013

Open Source Search FTW!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:40 pm

Open Source Search FTW! by Grant Ingersoll.

Abstract:

Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we’ll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.

If you aren’t already studying search engines, perhaps these slides will convince you to do so.

When you think about it, search precedes all other computer processing.

July 24, 2013

Exploring ElasticSearch

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:39 pm

Exploring ElasticSearch: A human-friendly tutorial for ElasticSearch. by Andrew Cholakian.

An incomplete tutorial on ElasticSearch.

However, unlike printed (dead tree) and pdf (dead electrons), you can suggest additional topics and I suspect that useful comments would be appreciated as well.

A “live” tutorial on popular software like ElasticSearch, that follows the software as it develops, could prove to be almost as popular as the software itself.

July 12, 2013

Aggregation Module – Phase 1 – Functional Design (ElasticSearch #3300)

Filed under: Aggregation,ElasticSearch,Merging,Search Engines,Topic Maps — Patrick Durusau @ 2:47 pm

Aggregation Module – Phase 1 – Functional Design (ElasticSearch Issue #3300)

From the post:

The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as “faceting”. Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.

The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name (“Aggregations”) also indicate the intention here – a generic yet extremely powerful framework for defining aggregations – any type of aggregation. The idea here is to have each aggregation defined as a “standalone” aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we’ve gained over the years working with facets and apply it when building the new framework.

(…)

If you have been following the discussion about “what would we do differently with topic maps” in the XTM group at LinkedIn, this will be of interest.

What is an aggregation if it is not a selection of items matching some criteria, which you can then “merge” together for presentation to a user?

Or “merge” together for further querying?

That is inconsistent with the imperative programming model of the TMDM, but it has the potential to open up distributed and parallel processing of topic maps.

Same paradigm but with greater capabilities.

July 8, 2013

100 Search Engines For Academic Research

Filed under: Search Engines,Searching — Patrick Durusau @ 7:43 pm

100 Search Engines For Academic Research

From the post:

Back in 2010, we shared with you 100 awesome search engines and research resources in our post: 100 Time-Saving Search Engines for Serious Scholars. It’s been an incredible resource, but now, it’s time for an update. Some services have moved on, others have been created, and we’ve found some new discoveries, too. Many of our original 100 are still going strong, but we’ve updated where necessary and added some of our new favorites, too. Check out our new, up-to-date collection to discover the very best search engine for finding the academic results you’re looking for.

(…)

When I saw the title for this post I assumed it was source code for search engines. 😉

Not so!

But don’t despair!

Consider all of them as possible comparisons for your topic map interface.

Or should I say the results delivered by your topic map interface?

Some are better than others but I am sure you can do better with a curated topic map.

Querying ElasticSearch – A Tutorial and Guide

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 6:59 pm

Querying ElasticSearch – A Tutorial and Guide by Rufus Pollock.

From the post:

ElasticSearch is a great open-source search tool that’s built on Lucene (like SOLR) but is natively JSON + RESTful. Its been used quite a bit at the Open Knowledge Foundation over the last few years. Plus, as its easy to setup locally its an attractive option for digging into data on your local machine.

While its general interface is pretty natural, I must confess I’ve sometimes struggled to find my way around ElasticSearch’s powerful, but also quite complex, query system and the associated JSON-based “query DSL” (domain specific language).

This post therefore provides a simple introduction and guide to querying ElasticSearch that provides a short overview of how it all works together with a good set of examples of some of the most standard queries.

(…)

This is a very nice introduction to ElasticSearch.

Read, bookmark and pass it along!

Better synonym handling in Solr

Filed under: Search Engines,Solr,Synonymy — Patrick Durusau @ 6:45 pm

Better synonym handling in Solr by Nolan Lawson.

From the post:

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

(image omitted)

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

Deep review of the handling of synonyms in Solr and a patch to improve its handling of synonyms.

The issue is now SOLR-4381 and is set for SOLR 4.4.

Interesting discussion continues under the SOLR issue.

July 7, 2013

Mini Search Engine…

Filed under: Graphs,Neo4j,Search Engines,Searching — Patrick Durusau @ 1:13 pm

Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog by Brian Du Preez.

From the post:

Continuing to chapter 4 of Programming Collection Intelligence (PCI) which is implementing a search engine.

I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn’t necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:
neo4j – learn
Graph theory for busy developers
Graphdatabases

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

The general search space is filled by major contenders.

But that leaves open opportunities for domain specific search services.

Law and medicine have specialized search engines. What commercially viable areas are missing them?

July 6, 2013

Norch- a search engine for node.js

Filed under: Indexing,node-js,Search Engines — Patrick Durusau @ 4:30 pm

Norch- a search engine for node.js by Fergus McDowall.

From the post:

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

  • Full text search
  • Stopword removal
  • Faceting
  • Filtering
  • Relevance weighting (tf-idf)
  • Field weighting
  • Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format.

Download the first release of Norch (0.2.1) here

See: https://github.com/fergiemcdowall/norch for various details and instructions.

Interesting but I am curious what advantage Norch offers over Solr or Elasticseach, for example?
.

June 27, 2013

Apache Solr volume 1 -….

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 1:13 pm

Apache Solr V[olume] 1 – Introduction, Features, Recency Ranking and Popularity Ranking by Ramzi Alqrainy.

I amended the title to expand v for volume. Just seeing the “v” made me think version. No true in this case.

Nothing new or earthshaking but a nice overview of Solr.

It is a “read along” slide deck so the absence of a presenter won’t impair its usefulness.

June 26, 2013

Scaling Through Partitioning and Shard Splitting in Solr 4 (Webinar)

Filed under: Indexing,Search Engines,Solr — Patrick Durusau @ 3:28 pm

Scaling Through Partitioning and Shard Splitting in Solr 4 by Timothy Potter.

Date: Thursday, July 18, 2013
Time: 10:00am Pacific Time

From the post:

Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we’ve scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.

In practice, it’s common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you’ll learn about new features in Solr to help manage large-scale clusters. Specifically, we’ll cover data partitioning and shard splitting.

Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We’ll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.

Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.

Just in time for when you finish your current Solr reading! 😉

Definitely on the calendar!

June 25, 2013

Hadoop for Everyone: Inside Cloudera Search

Filed under: Cloudera,Hadoop,Search Engines,Searching — Patrick Durusau @ 12:26 pm

Hadoop for Everyone: Inside Cloudera Search by Eva Andreasson.

From the post:

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!

As a workload natively integrated with CDH, Cloudera Search benefits from the same security model, access to the same data pool, and cost-efficient storage. In addition, it is added to the services monitored and managed by Cloudera Manager on the cluster, providing a unified production visibility and rich cluster management – a priceless tool for any cluster admin.

In the rest of this post, I’ll describe some of Cloudera Search’s most important features.

You have heard the buzz about Cloudera Search, now get a quick list of facts and pointers to more resources!

The most significant fact?

Cloudera Search uses Apache Solr.

If you are looking for search capabilities, what more need I say?

June 23, 2013

A new Lucene suggester based on infix matches

Filed under: Lucene,Search Engines — Patrick Durusau @ 8:39 am

A new Lucene suggester based on infix matches by Michael McCandless.

From the post:

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

Lucene is a flagship open source project. It just keeps pushing the boundaries of its area of interest.

June 22, 2013

Tips for Tuning Solr Search: No Coding Required [June 25, 2013]

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:40 pm

Tips for Tuning Solr Search: No Coding Required

Date & time: Tuesday, June 25, 2013 01:00 PM EDT
Duration: 60 min
Speakers: Nick Veenhof, Senior Search Engineer, Acquia

Description:

Helping online visitors easily find what they’re looking for is key to a website’s success. In this webinar, you’ll learn how to improve search in ways that don’t require any coding or code changes. We’ll show you easy modifications to tune up the relevancy to more advanced topics, such as altering the display or configuring advanced facets.

Acquia’s Senior Search Engineer, Nick Veenhof , will guide you step by step through improving the search functionality of a website, using an in-house version of an actual conference site.

Some of the search topics we’ll demonstrate include:

  • Clean faceted URL’s
  • Adding sliders, checkboxes, sorting and more to your facets
  • Complete customization of your search displays using Display Suite
  • Tuning relevancy by using Solr optimization

This webinar will make use of the Facet API module suite in combination with the Apache Solr Search Integration module suite. We’ll also use some generic modules to improve the search results that are independent of the search technology that is used. All of the examples shown are fully supported by Acquia Search.

I haven’t seen a webinar from Acquia so going to take a chance and attend.

Some webinars are pure gold, others, well, extended infommercials at best.

Will be reporting back on the experience!


First complaint: Why the long registration form for the webinar? Phone number? What? Is your marketing department going to pester me into buying your product or service?

If you want to offer a webinar, name and email should be enough. You need to know how many attendees to allow for but more than that is a waste of your time and mine.

June 21, 2013

::MG4J: Managing Gigabytes for Java™

Filed under: Indexing,MG4J,Search Engines — Patrick Durusau @ 4:43 pm

::MG4J: Managing Gigabytes for Java™

From the webpage:

Release 5.0 has several source and binary incompatibilities, and introduces quasi-succinct indices[broken link]. Benchmarks on the performance of quasi-succinct indices can be found here; for instance, this table shows the number of seconds to answer 1000 multi-term queries on a document collection of 130 million web pages:


MG4J MG4J* Lucene 3.6.2
Terms 70.9 132.1 130.6
And 27.5 36.7 108.8
Phrase 78.2 127.2
Proximity 106.5 347.6

Both engines were set to just enumerate the results without scoring. The column labelled MG4J* gives the timings of an artificially modified version in which counts for each retrieved document have been read (MG4J now stores document pointers and counts in separate files, but Lucene interleaves them, so it has to read counts compulsorily). Proximity queries are conjunctive queries that must be satisfied within a window of 16 words. The row labelled “Terms” gives the timings for enumerating the posting lists of all terms appearing in the queries.

I tried the link for “quasi-succinct indices” and it consistently returns a 404.

In lieu of that reference, see: Quasi-Succinct Indices by Sebastiano Vigna.

Abstract:

Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.

Heavy sledding but with search results as shown from the benchmark, well worth the time to master.

June 19, 2013

Terms filter lookup [ElasticSearch]

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 1:17 pm

Terms filter lookup by Zachary Tong.

From the post:

There is a new feature in the 0.90 branch that is pretty awesome: the Terms Filter now supports document lookups.

In a normal Terms Filter, you provide a list of Terms that you want to filter against. This is fine for small lists, but what if you have 100 terms? A thousand terms? That is a lot of data to pass over the wire. If that list of terms is stored in your index somewhere, you also have to retrieve it first…just so you can pass it back to Elasticsearch.

The new lookup feature tells Elasticsearch to use another document as your terms array. Instead of passing 1000 terms, you simply tell the Terms Filter “Hey, all the terms I want are in this document”. Elasticsearch will fetch that document internally, extract the terms and perform your query.

Very cool!

Even has a non-Twitter example. 😉

Nutch/ElasticSearch News!

Filed under: ElasticSearch,Nutch,Search Engines — Patrick Durusau @ 12:51 pm

Apache Nutch-1527

To summarize: Elasticsearch indexer committed to the trunk of Apache Nutch in rev. 1494496.

Enjoy!

Apache Nutch: Web-scale search engine toolkit

Filed under: Nutch,Search Engines — Patrick Durusau @ 10:32 am

Apache Nutch: Web-scale search engine toolkit by Andrezej Białecki.

From the description:

This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.

One of the best component based descriptions of Nutch that I have ever seen.

« Newer PostsOlder Posts »

Powered by WordPress