Search + Big Data: It’s (still) All About the User (Users or Documents?)

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 8, 2011

Search + Big Data: It’s (still) All About the User (Users or Documents?)

Filed under: Hadoop,Lucene,LucidWorks,Mahout,Solr,Topic Maps — Patrick Durusau @ 7:44 pm

Search + Big Data: It’s (still) All About the User by Grant Ingersoll.

Abstract:

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow’s enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Awesome as always!

Please watch the presentation and review the slides before going further. What follows won’t make much sense without Grant’s presentation as a context. I’ll wait……

Back so soon?

On slide 4 (I said to review the slides), Grant presents four overlapping areas, starting with Documents: Models, Feature Selection; Content Relationships: Page Rank, etc., Organization; Queries: Phrases, NLP; User Interaction: Clicks, Ratings/Reviews, Learning to Rank, Social Graph; and the intersection of those four areas is where Grant says search is rapidly evolving.

On slide 5 (sorry, last slide reference), Grant say to mine that intersection is a loop composed of: Search -> Discovery -> Analytics -> (back to Search). All of which involve processing of data that has been collected from use of the search interface.

Grant’s presentation made clear something that I have been overlooking:

Search/Indexing, as commonly understood, does not capture any discoveries or insights of users.

Even the search trails that Grant mentions are just lemming tracks complete with droppings. You can follow them if you like, may find interesting data, may not.

My point being that there is no way to capture the user’s insight that LBJ, for instance, is a common acronym for Lyndon Baines Johnson. So that the next user who searches for LBJ will find the information contributed by a prior user. Such as distinguishing application of Lyndon Baines Johnson to a graduate school (Lyndon B. Johnson School of Public Affairs), a hospital (Lyndon B. Johnson General Hospital), a PBS show (American Experience . The Presidents . Lyndon B. Johnson), a biography (American President: Lyndon Baines Johnson), and that is in just the first ten (10) “hits.” Oh, and as the name of an American President.

Grant made that clear for me with his loop of Search -> Discovery -> Analytics -> (back to Search) because Search only ever focuses on the documents, never the user’s insight into the documents.

And with every search, every user (with the exception of search trails), starts over at the beginning.

What if a colleague found a bug in program code, but you have to start at the beginning of the program and work your way there. Good use of your time? To reset with every user? That is what happens with search, nearly a complete reset. (Not complete because of page rank, etc. but only just.)

If we are going to make it “All About the User,” shouldn’t we be indexing their insights* into data? (Big or otherwise.)

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 8, 2011

Search + Big Data: It’s (still) All About the User (Users or Documents?)

No Comments