Searching « Another Word For It

November 20, 2013

Big Data: Main Research/Business Challenges Ahead?

Filed under: Findability,Integration,Marketing,Personalization,Searching — Patrick Durusau @ 7:13 pm

Big Data Analytics at Thomson Reuters. Interview with Jochen L. Leidner by Roberto V. Zicari.

In case you don’t know, Jochen L. Leidner has the title: “Lead Scientist, of the London R&D at Thomson Reuters.”

Which goes a long way to explaining the importance of this Q&A exchange:

Q12 What are the main research challenges ahead? And what are the main business challenges ahead?

Jochen L. Leidner: Some of the main business challenges are the cost pressure that some of our customers face, and the increasing availability of low-cost or free-of-charge information sources, i.e. the commoditization of information. I would caution here that whereas the amount of information available for free is large, this in itself does not help you if you have a particular problem and cannot find the information that helps you solve it, either because the solution is not there despite the size, or because it is there but findability is low. Further challenges include information integration, making systems ever more adaptive, but only to the extent it is useful, or supporting better personalization. Having said this sometimes systems need to be run in a non-personalized mode (e.g. in the field of e-discovery, you need to have a certain consistency, namely that the same legal search systems retrieves the same things today and tomorrow, and to different parties.

How are you planning to address:

The required information is not available in the system. A semantic 404 as it were. To distinguish the case of its there but wrong search terms in use.
Low findability.
Information integration (not normalization)
System adaptability/personalization, but to users and not developers.
Search consistency, same result tomorrow as today.

The rest of the interview is more than worth your time.

I singled out the research/business challenges as a possible map forward.

We all know where we have been.

Comments Off

November 6, 2013

elasticsearch 1.0.0.beta1 released

Filed under: ElasticSearch,Lucene,Search Engines,Searching — Patrick Durusau @ 8:04 pm

elasticsearch 1.0.0.beta1 released by Clinton Gormley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta1, the first public release on the road to 1.0.0. The countdown has begun!

You can download Elasticsearch 1.0.0.Beta1 here.

In each beta release we will add one major new feature, giving you the chance to try it out, to break it, to figure out what is missing and to tell us about it. Your use cases, ideas and feedback is essential to making Elasticsearch awesome.

The main feature we are showcasing in this first beta is Distributed Percolation.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

distributed percolation

For those of you who aren’t familiar with percolation, it is “search reversed”. Instead of running a query to find matching docs, percolation allows you to find queries which match a doc. Think of people registering alerts like: tell me when a newspaper publishes an article mentioning “Elasticsearch”.

Percolation has been supported by Elasticsearch for a long time. In the current implementation, queries are stored in a special _percolator index which is replicated to all nodes, meaning that all queries exist on all nodes. The idea was to have the queries alongside the data.

But users are using it at a scale that we never expected, with hundreds of thousands of registered queries and high indexing rates. Having all queries on every node just doesn’t scale.

Enter Distributed Percolation.

In the new implementation, queries are registered under the special .percolator type within the same index as the data. This means that queries are distributed along with the data, and percolation can happen in a distributed manner across potentially all nodes in the cluster. It also means that an index can be made as big or small as required. The more nodes you have the more percolation you can do.
…

After reading the news release I understand why Twitter traffic on the elasticsearch release surged today. 😉

A new major feature with each beta release? That should attract some attention.

Not to mention “distributed percolation.”

Getting closer to a result being the “result” at X time on the system clock.

Comments Off

Introduction to Information Retrieval

Filed under: Classification,Indexing,Information Retrieval,Probalistic Models,Searching — Patrick Durusau @ 5:10 pm

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

A bit dated now (2008) but the underlying principles of information retrieval remain the same.

I have a hard copy but the additional materials and ability to cut-n-paste will make this a welcome resource!

We’d be pleased to get feedback about how this book works out as a textbook, what is missing, or covered in too much detail, or what is simply wrong. Please send any feedback or comments to: informationretrieval (at) yahoogroups (dot) com

Online resources

Apart from small differences (mainly concerning copy editing and figures), the online editions should have the same content as the print edition.

The following materials are available online. The date of last update is given in parentheses.

HTML edition (2009.04.07)

PDF of the book for online viewing (with nice hyperlink features, 2009.04.01)
PDF of the book for printing (2009.04.01)
PDFs of individual chapters (2009.04.01)
Stanford slides and assignments (2013.09.13)
University of Munich slides and assignments (2013.09.13)
errata (2009.03.31)
8th European Summer School on information Retrieval (2011.08.28)

Information retrieval resources

A list of information retrieval resources is also available.

Introduction to Information Retrieval: Table of Contents

Front matter (incl. table of notations) pdf

01 Boolean retrieval pdf html

02 The term vocabulary & postings lists pdf html

03 Dictionaries and tolerant retrieval pdf html

04 Index construction pdf html

05 Index compression pdf html

06 Scoring, term weighting & the vector space model pdf html

07 Computing scores in a complete search system pdf html

08 Evaluation in information retrieval pdf html

09 Relevance feedback & query expansion pdf html

10 XML retrieval pdf html

11 Probabilistic information retrieval pdf html

12 Language models for information retrieval pdf html

13 Text classification & Naive Bayes pdf html

14 Vector space classification pdf html

15 Support vector machines & machine learning on documents pdf html

16 Flat clustering pdf html Resources.

17 Hierarchical clustering pdf html

18 Matrix decompositions & latent semantic indexing pdf html

19 Web search basics pdf html

20 Web crawling and indexes pdf html
21 Link analysis pdf html

Bibliography & Index pdf

bibtex file bib

Comments Off

November 3, 2013

Penguins in Sweaters…

Filed under: Searching,Semantics,Serendipity — Patrick Durusau @ 8:38 pm

Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content by Ilaria Bordino, Yelena Mejova and Mounia Lalmas.

Abstract:

In many cases, when browsing the Web users are searching for specific information or answers to concrete questions. Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further. What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted from two sources of user-generated content – Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers, a more unconstrained question/answering forum – in promoting serendipitous search. In this work, the content of each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and the desirable metadata attributes in serendipitous search.

From the introduction:

A system supporting serendipity must provide results that are surprising, semantically cohesive, i.e., relevant to some information need of the user, or just interesting. In this paper, we tackle the question of what makes a result serendipitous.

Serendipity, now that would make a very interesting product demonstration!

In particular if the search results were interesting to the client.

I must admit when I saw the first part of the title I was expecting an article on Linux. 😉

Comments Off

October 26, 2013

Integrating Nutch 1.7 with ElasticSearch

Filed under: ElasticSearch,Nutch,Searching — Patrick Durusau @ 3:03 pm

Integrating Nutch 1.7 with ElasticSearch

From the post:

With Nutch 1.7 the possibility for integrating with ElasticSearch became available. However setting up the integration turned out to be quite a treasure hunt for me. For anybody else wanting to achieve the same result without tearing out as much hair as I did please find some simple instructions on this page that hopefully will help you in getting Nutch to talk to ElasticSearch.

I’m assuming you have both Nutch and ElasticSearch running fine by which I mean that Nutch does it crawl, fetch, parse thing and ElasticSearch is doing its indexing and searching magic, however not yet together.

All of the work involved is in Nutch and you need to edit nutch-site.xml in the conf directory to get things going. First off you need to activate the elasticsearch indexer plugin by adding the following line to nutch-site.xml:

A post that will be much appreciated by anyone who wants to integrate Nutch with ElasticSearch.

A large number of software issues are matters of configuration, once you know the configuration.

The explorers who find those configurations and share them with others are under appreciated.

Comments Off

October 24, 2013

The Gap Between Documents and Answers

Filed under: Search Behavior,Search Engines,Searching,Semantic Search — Patrick Durusau @ 1:49 pm

I mentioned the webinar: Driving Knowledge-Worker Performance with Precision Search Results a few days ago in Findability As Value Proposition.

There was one nugget (among many) in the webinar before I lose sight of how important it is to topic maps and semantic technologies in general.

Dan Taylor (Earley and Associates) was presenting a maturation diagram for knowledge technologies.

See the presentation for the details but what struck me was than on the left side (starting point) there were documents. On the right side (the goal) were answers.

Think about that for a moment.

When you search in Google or any other search engine, what do you get back? Pointers to documents, presentations, videos, etc.

What task remains? Digging out answers from those documents, presentations, videos.

A mature knowledge technology goes beyond what an average user is searching for (the Google model) and returns information based on a specific user for a particular domain, that is, an answer.

For the average user there may be no better option than to drop them off in the neighborhood of a correct answer. Or what may be a correct answer to the average user. No guarantees that you will find it.

The examples in the webinar are in specific domains where user queries can be modeled accurately enough to formulate answers (not documents) to answer queries.

Reminds me of TaxMap. You?

If you want to do a side by side comparison, try USC: Title 26 – Internal Revenue Code. From the Legal Information Institute (Cornell)

Don’t get me wrong, the Cornell materials are great but they reflect the U.S. Code, nothing more or less. That is to say the text you find there isn’t engineered to provide answers. 😉

I will update this point with the webinar address as soon as it appears.

Comments Off

October 11, 2013

Free Text and Spatial Search…

Filed under: Lucene,Searching,Spatial Index — Patrick Durusau @ 3:08 pm

Free Text and Spatial Search with Spatial4J and Lucene Spatial by Steven Citron-Pousty.

From the post:

Hey there, Shifters. One of my talks at FOSS4G 2013 covered Lucene Spatial. Todays post is going to follow up on my post about creating Lucene Indices by adding spatial capabilities to the index. In the end you will have a a full example on how create a fast and full featured full text spatial search on any documents you want to use.

How to add spatial to your Lucene index

In the last post I covered how to create a Lucene index so in this post I will just cover how to add spatial. The first thing you need to understand are the two pieces of how spatial is handled by Lucene. A lot of this work is done by Dave Smiley. He gave a great presentation on all this technology at Lucene/Solr Revolution 2013. If you really want to dig in deep, I suggest you watch his 1:15 h:m long video – my blog post is more the Too Long Didn’t Listen (TL;DL) version.

Spatial4J: This Java library provides geospatial shapes, distance calculations, and importing and exporting shapes. It is Apache Licensed so it can be used with other ASF projects. Lucene Spatial uses Spatial4J to create the spatial objects that get indexed along with the documents. It will also be used when calculating distances in a query or when we want to convert between distance units. Spatial4J is able to handle real-world on a sphere coordinates (what comes out of a GPS unit) and projected coordinates (any 2D map) for both shapes and distances.

Short aside: The oldest Java based spatial library is JTS and is used in many other Open Source Java geospatial projects. Spatial4J uses JTS under the hood if you want to work with Polygon shapes. Unfortunately, until recently it was LGPL and so could not be included in Lucene. JTS has announced it’s intention to go to a BSD type license which should allow Spatial4J and JTS to start working together for more Java Spatial goodness for all. One of the beauties of FOSS is the ability to see development discussions happen in the open.

Lucene Spatial After many different and custom iterations – there is now lucene spatial built right into Lucene as a standard library. It is new with the 4.x releases of Lucene. What Lucene spatial does is provide the indexing and search strategies for spatial4j shapes stored in a Lucene index. It has SpatialStrategy as the base class to define the signature that any spatial strategy must fulfill. You then use the same strategy for the index writing and reading.

Today I will show the code to use spatial4j with Lucene Spatial to add a spatially indexed field to your lucene index.

…

Pay special attention to the changes that made it possible for Spatial4J and JTS work together.

Cooperation between projects makes the resulting whole stronger.

Some office projects need to have that realization.

Comments Off

October 9, 2013

Explore the world’s constitutions with a new online tool

Filed under: Law,Searching — Patrick Durusau @ 7:52 pm

Explore the world’s constitutions with a new online tool

From the post:

Constitutions are as unique as the people they govern, and have been around in one form or another for millennia. But did you know that every year approximately five new constitutions are written, and 20-30 are amended or revised? Or that Africa has the youngest set of constitutions, with 19 out of the 39 constitutions written globally since 2000 from the region?

…

With this in mind, Google Ideas supported the Comparative Constitutions Project to build Constitute, a new site that digitizes and makes searchable the world’s constitutions. Constitute enables people to browse and search constitutions via curated and tagged topics, as well as by country and year. The Comparative Constitutions Project cataloged and tagged nearly 350 themes, so people can easily find and compare specific constitutional material. This ranges from the fairly general, such as “Citizenship” and “Foreign Policy,” to the very specific, such as “Suffrage and turnouts” and “Judicial Autonomy and Power.”

I applaud the effort but wonder about the easily find and compare specific constitutional material?

Legal systems are highly contextual.

See the Constitution Annotated (U.S.) if you want see interpretations of words that would not occur to you. Promise.

Comments Off

ElasticHQ

Filed under: ElasticSearch,Searching — Patrick Durusau @ 7:33 pm

ElasticHQ

From the homepage:

Real-Time Monitoring

From monitoring individual cluster nodes, to viewing real-time threads, ElasticHQ enables up-to-the-second insight in to ElasticSearch cluster runtime metrics and configurations, using the ElasticSearch REST API. ElasticHQ’s real-time update feature works by polling your ElasticSearch cluster intermittently, always pulling the latest aggregate information and deltas; keeping you up-to-date with the internals of your working cluster.

Full Cluster Management

Elastic HQ gives you complete control over your ElasticSearch clusters, nodes, indexes, and mappings. The sleek, intuitive UI gives you all the power of the ElasticSearch Admin API, without having to tangle with REST and large cumbersome JSON requests and responses.

Search and Query

Easily find what you’re looking for by querying a specific Index or several Indices at once. ElasticHQ provides a Query interface, along with all of the other Administration UI features.

No Software to Install

ElasticHQ does not require any software. It works in your web browser, allowing you to manage and monitor your ElasticSearch clusters from anywhere at any time. Built on responsive CSS design, ElasticHQ adjusts itself to any screen size on any device.

I don’t know of any compelling reason to make ElasticSearch management and monitoring difficult for sysadmins. 😉

If approaches like ElasticHQ make their lives easier, perhaps they won’t begrudge users having better UIs as well.

Comments Off

October 8, 2013

Quepid [Topic Map Tuning?]

Filed under: Recommendation,Searching,Solr — Patrick Durusau @ 4:36 pm

Measure and Improve Search Quality with Quepid by Doug Turnbull.

From the post:

Let’s face it, returning good search results means making money. To this end, we’re often hired to tune search to ensure that search results are as close as possible to the intent of a user’s search query. Matching users intent to results, what we call “relevancy” is what gets us up in the morning. It’s what drives us to think hard about the dark mysteries of tuning Solr or machine-learning topics such as recommendation-based product search.

While we can do amazing feats of wizardry to make individual improvements, it’s impossible with today’s tools to do much more than prove that one problem has been solved. Search engines rank results based on a single set of rules. This single set of rules is in charge of how all searches are ranked. It’s very likely that even as we solve one problem by modifying those rules, we create another problem — or dozens of them, perhaps far more devastating than the original problem we solved.

…

Quepid is our instant search quality testing product. Born out of our years of experience tuning search, Quepid has become our go to tool for relevancy problems. Built around the idea of Test Driven Relevancy, Quepid allows the search developer to collaborate with product and content experts to

Identify, store, and execute important queries

Provide statistics/rankings that measure the quality of a search query

Tune search relevancy

Immediately visualize the impact of tuning on queries

Rinse & Repeat Instantly

The result is a tool that empowers search developers to experiment with the impact of changes across the search experience and prove to their bosses that nothing broke. Confident in that data will prove or disprove their ideas instantly, developers are even freer experiment more than they might ever have before.

Any thoughts on automating a similar cycle to test the adding of subjects to a topic map?

Or adding subject identifies that would trigger additional merging?

Or just reporting the merging over and above what was already present?

Comments Off

Search-Aware Product Recommendation in Solr (Users vs. Experts?)

Filed under: Interface Research/Design,Recommendation,Searching,Solr — Patrick Durusau @ 10:43 am

Search-Aware Product Recommendation in Solr by John Berryman.

From the post:

Building upon earlier work with semantic search, OpenSource Connections is excited to unveil exciting new possibilities with Solr-based product recommendation. With this technology, it is now possible to serve user-specific, search-aware product recommendations directly from Solr.

In this post, we will review a simple Search-Aware Recommendation using an online grocery service as an example of e-commerce product recommendation. In this example I have built up a basic keyword search over the product catalog. We’ve also added two fields to Solr: purchasedByTheseUsers and recommendToTheseUsers. Both fields contain lists of userIds. Recall that each document in the index corresponds to a product. Thus the purchasedByTheseUsers field literally lists all of the users who have purchased said product. The next field, recommendToTheseUsers, is the special sauce. This field lists all users who might want to purchase the corresponding product. We have extracted this field using a process called collaborative filtering, which is described in my previous post, Semantic Search With Solr And Python Numpy. With collaborative filtering, we make product recommendation by mathematically identifying similar users (based on products purchased) and then providing recommendations based upon the items that these users have purchased.

Now that the background has been established, let’s look at the results. Here we search for 3 different products using two different, randomly-selected users who we will refer to as Wendy and Dave. For each product: We first perform a raw search to gather a base understanding about how the search performs against user queries. We then search for the intersection of these search results and the products recommended to Wendy. Finally we also search for the intersection of these search results and the products recommended to Dave.

BTW, don’t miss the invitation to be an alpha tester for Solr Search-Aware Product Recommendation at the end of John’s post.

Reading John’s post it occurred to me that an alternative to mining other users’ choices, you could have an expert develop the recommendations.

Much like we use experts to develop library classification systems.

But we don’t, do we?

Isn’t that interesting?

I suspect we don’t use experts for product recommendations because we know that shopping choices depends on a similarity between consumers

We may not know what the precise nature of the similarity may be, but it is sufficient that we can establish its existence in the aggregate and sell more products based upon it.

Shouldn’t the same be true for finding information or data?

If similar (in some possibly unknown way) consumers of information find information in similar ways, why don’t we organize information based on similar patterns of finding?

How an “expert” finds information may be more “precise” or “accurate,” but if a user doesn’t follow that path, the user doesn’t find the information.

A great path that doesn’t help users find information is like having a great road with sidewalks, a bike path, cross-walks, good signage, that goes no where.

How do you incorporate user paths in your topic map application?

Comments Off

September 27, 2013

Google Alters Search… [Pushy Suggestions]

Filed under: Google Knowledge Graph,Search Engines,Searching — Patrick Durusau @ 4:27 pm

Google Alters Search to Handle More Complex Queries by Claire Cain Miller.

From the post:

Google on Thursday announced one of the biggest changes to its search engine, a rewriting of its algorithm to handle more complex queries that affects 90 percent of all searches.

The change, which represents a new approach to search for Google, required the biggest changes to the company’s search algorithm since 2000. Now, Google, the world’s most popular search engine, will focus more on trying to understand the meanings of and relationships among things, as opposed to its original strategy of matching keywords.

The company made the changes, executives said, because Google users are asking increasingly long and complex questions and are searching Google more often on mobile phones with voice search.

“They said, ‘Let’s go back and basically replace the engine of a 1950s car,’ ” said Danny Sullivan, founding editor of Search Engine Land, an industry blog. “It’s fair to say the general public seemed not to have noticed that Google ripped out its engine while driving down the road and replaced it with something else.”

One of the “other” changes is “pushy suggestions.”

In the last month I have noticed that if my search query is short that I will get Google’s suggested completion rather than my search request.

How short? Just has to be shorter than the completion suggested by Google.

A simple return means it adopts its suggestion and not your request.

You don’t believe me?

OK, type in:

charter

Note the autocompletion to:

charter.com

That’s OK if I am searching for the cable company but not if I am searching for “charter” as in a charter for technical work.

I am required to actively avoid the suggestion by Google.

I can avoid Google’s “pushy suggestions” by hitting the space bar.

But like many people, I toss off Google searches without ever looking at the search or URL box. I don’t look up until I have the results. And now sometimes the wrong results.

I would rather have a search engine execute my search by default and its suggestions only when asked.

How about you?

Comments (1)

September 23, 2013

Broadening Google Patents [Patent Troll Indigestion]

Filed under: Law,Patents,Searching — Patrick Durusau @ 12:42 pm

Broadening Google Patents by Jon Orwant.

From the post:

Last year, we launched two improvements to Google Patents: the Prior Art Finder and European Patent Office (EPO) patents. Today we’re happy to announce the addition of documents from four new patent agencies: China, Germany, Canada, and the World Intellectual Property Organization (WIPO). Many of these documents may provide prior art for future patent applications, and we hope their increased discoverability will improve the quality of patents in the U.S. and worldwide.

The broadening of Google Patents is welcome news!

Especially following the broadening of “prior art” under the America Invents Act (AIA).

On the expansion of prior art, such as publication before date of filing the patent (old rule was before the date of invention), a good summary can be found at: The Changing Boundaries of Prior Art under the AIA: What Your Company Needs to Know.

The information you find needs to remain found, intertwined with other information you find.

Regular search engines won’t help you there. May I suggest topic maps?

Comments Off

September 22, 2013

…Introducing … Infringing Content Online

Filed under: Intellectual Property (IP),Search Engines,Searching — Patrick Durusau @ 12:43 pm

New Study Finds Search Engines Play Critical Role in Introducing Audiences To Infringing Content Online

From the summary at Full Text Reports:

Today, MPAA Chairman Senator Chris Dodd joined Representatives Howard Coble, Adam Schiff, Marsha Blackburn and Judy Chu on Capitol Hill to release the results of a new study that found that search engines play a significant role in introducing audiences to infringing movies and TV shows online. Infringing content is a TV show or movie that has been stolen and illegally distributed online without any compensation to the show or film’s owner.
…
The study found that search is a major gateway to the initial discovery of infringing content online, even in cases when the consumer was not looking for infringing content. 74% of consumers surveyed cited using a search engine as a navigational tool the first time they arrived at a site with infringing content. And the majority of searches (58%) that led to infringing content contained only general keywords — such as the titles of recent films or TV shows, or phrases related to watching films or TV online — and not specific keywords aimed at finding illegitimate content.

I rag on search engines fairly often about the quality of their results so in light of this report, I wanted to give them a shout out of: Well done!

They may not be good at the sophisticated content discovery that I find useful, but on the other hand, when sweat hogs are looking for entertainment, search content can fill the bill.

On the other hand, knowing that infringing content can be found may be good for PR purposes but not much more. Search results don’t capture (read identify) enough subjects to enable the mining of patterns of infringement and other data analysis relevant to opposing to infringement.

Infringing content is easy to find so the business case for topic maps lies with content providers. Who need more detail (read subjects and associations) than a search engine can provide.

New Study Finds Search Engines Play Critical Role in Introducing Audiences To Infringing Content Online (PDF of the news release)

Update: Understanding the Role of Search in Online Piracy. The full report. Additional detail but no links to the data.

Comments Off

September 21, 2013

Search Rules using Mahout’s Association Rule Mining

Filed under: Machine Learning,Mahout,Searching — Patrick Durusau @ 2:05 pm

Search Rules using Mahout’s Association Rule Mining by Sujit Pal.

This work came about based on a conversation with one of our domain experts, who was relaying a conversation he had with one of our clients. The client was looking for ways to expand the query based on terms already in the query – for example, if a query contained “cattle” and “neurological disorder”, then we should also server results for “bovine spongiform encephalopathy”, also known as “mad cow disease”.

We do semantic search, which involves annotating words and phrases in documents with concepts from our taxonomy. One view of an annotated document is the bag of concepts view, where a document is modeled as a sparsely populated array of scores, each position corresponding to a concept. One way to address the client’s requirement would be to do Association Rule Mining on the concepts, looking for significant co-occurrences of a set of concepts per document across the corpus.

The data I used to build this proof-of-concept with came from one of my medium sized indexes, and contains 12,635,756 rows and 342,753 unique concepts. While Weka offers the Apriori algorithm, I suspect that it won’t be able to handle this data volume. Mahout is probably a better fit, and it offers the FPGrowth algorithm running on Hadoop, so thats what I used. This post describes the things I had to do to prepare my data for Mahout, run the job with Mahout on Amazon Elastic Map Reduce (EMR) platform, then post process the data to get useful information out of it.
(…)

I don’t know that I would call these “search rules” but they would certainly qualify as input into defining merging rules.

Particularly if I was mining domain literature where co-occurrences of terms are likely to have the same semantic. Not always but likely. The likelihood of semantic sameness is something you can sample for and develop confidence measures about.

Comments Off

September 19, 2013

Context Aware Searching

Filed under: Context,RDF,Searching,Semantic Graph,Semantic Web — Patrick Durusau @ 9:53 am

Scaling Up Personalized Query Results for Next Generation of Search Engines

From the post:

North Carolina State University researchers have developed a way for search engines to provide users with more accurate, personalized search results. The challenge in the past has been how to scale this approach up so that it doesn’t consume massive computer resources. Now the researchers have devised a technique for implementing personalized searches that is more than 100 times more efficient than previous approaches.

At issue is how search engines handle complex or confusing queries. For example, if a user is searching for faculty members who do research on financial informatics, that user wants a list of relevant webpages from faculty, not the pages of graduate students mentioning faculty or news stories that use those terms. That’s a complex search.

“Similarly, when searches are ambiguous with multiple possible interpretations, traditional search engines use impersonal techniques. For example, if a user searches for the term ‘jaguar speed,’ the user could be looking for information on the Jaguar supercomputer, the jungle cat or the car,” says Dr. Kemafor Anyanwu, an assistant professor of computer science at NC State and senior author of a paper on the research. “At any given time, the same person may want information on any of those things, so profiling the user isn’t necessarily very helpful.”

Anyanwu’s team has come up with a way to address the personalized search problem by looking at a user’s “ambient query context,” meaning they look at a user’s most recent searches to help interpret the current search. Specifically, they look beyond the words used in a search to associated concepts to determine the context of a search. So, if a user’s previous search contained the word “conservation” it would be associated with concepts likes “animals” or “wildlife” and even “zoos.” Then, a subsequent search for “jaguar speed” would push results about the jungle cat higher up in the results — and not the automobile or supercomputer. And the more recently a concept has been associated with a search, the more weight it is given when ranking results of a new search.

I rather like the contrast of ambiguous searches being resolved with “impersonal techniques.”

The paper, Scaling Concurrency of Personalized Semantic Search over Large RDF Data by Haizhou Fu, Hyeongsik Kim, and Kemafor Anyanwu, has this abstract:

Recent keyword search techniques on Semantic Web are moving away from shallow, information retrieval-style approaches that merely find “keyword matches” towards more interpretive approaches that attempt to induce structure from keyword queries. The process of query interpretation is usually guided by structures in data, and schema and is often supported by a graph exploration procedure. However, graph exploration-based interpretive techniques are impractical for multi-tenant scenarios for large database because separate expensive graph exploration states need to be maintained for different user queries. This leads to significant memory overhead in situations of large numbers of concurrent requests. This limitation could negatively impact the possibility of achieving the ultimate goal of personalizing search. In this paper, we propose a lightweight interpretation approach that employs indexing to improve throughput and concurrency with much less memory overhead. It is also more amenable to distributed or partitioned execution. The approach is implemented in a system called “SKI” and an experimental evaluation of SKI’s performance on the DBPedia and Billion Triple Challenge datasets show orders-of-magnitude performance improvement over existing techniques.

If you are interesting in scaling issues for topic maps, note the use of indexing as opposed to graph exploration techniques in this paper.

Also consider mining “discovered” contexts that lead to “better” results from the viewpoint of users. Those could be the seeds for serializing those contexts as topic maps.

Perhaps even directly applicable to work by researchers, librarians, intelligence analysts.

Seasoned searchers use richer contexts in searching that the average user and if those contexts are captured, they could enrich the search contexts of the average user.

Comments Off

September 16, 2013

Building better search tools: problems and solutions

Filed under: Search Behavior,Search Engines,Searching — Patrick Durusau @ 4:38 pm

Building better search tools: problems and solutions by Vincent Granville

From the post:

Have you ever done a Google search for mining data? It returns the same results as for data mining. Yet these are two very different keywords: mining data usually means data about mining. And if you search for data about mining you still get the same results anyway.

(graphic omitted)

Yet Google has one of the best search algorithms. Imagine an e-store selling products, allowing users to search for products via a catalog powered with search capabilities, but returning irrelevant results 20% of the time. What a loss of money! Indeed, if you were an investor looking on Amazon to purchase a report on mining data, all you will find are books on data mining and you won’t buy anything: possibly a $500 loss for Amazon. Repeat this million times a year, and the opportunity cost is in billions of dollars.

There are a few issues that make this problem difficult to fix. While the problem is straightforward for decision makers, CTO’s or CEO’s to notice, understand and assess the opportunity cost (just run 200 high value random search queries, see how many return irrelevant results), the communication between the analytic teams and business people is faulty: there is a short somewhere.

There might be multiple analytics teams working as silos – computer scientists, statisticians, engineers – sometimes aggressively defending their own turfs and having conflicting opinions. What the decision makers eventually hears is a lot of noise and lots of technicalities, and they don’t know how to start, how much it will cost to fix it, and how complex the issue is, and who should fix it.

Here I discuss the solution and explain it in very simple terms, to help any business having a search engine and an analytic team, easily fix the issue.

Vincent has some clever insights into this particular type of search problem but I think it falls short of being “easily” fixed.

Read his original post and see if you think the solution is an “easy” one.

Comments Off

Questions

Filed under: Humor,Searching — Patrick Durusau @ 4:26 pm

Greg Linden pointed out an excellent xkcd cartoon composed of auto-completed questions from Google.

Maximize your enjoyment by entering a few of the terms in your search box.

The auto-completed questions and their “answers” may surprise you.

Comments Off

September 13, 2013

Client-side full-text search in CSS

Filed under: CSS3,Full-Text Search,Searching — Patrick Durusau @ 4:40 pm

Client-side full-text search in CSS by François Zaninotto.

Not really “full-text search” in any meaningful sense of the phrase.

But I can imagine it being very useful and the comments to his post about “appropriate” use of CSS are way off base.

The only value of CSS or Javascript or (fill in your favorite technology) is creation and/or delivery of content to a user.

Despite some naming issues, this has the potential to deliver content to users.

You may have other criteria that influence you choice of mechanisms but “appropriate” should not be one of them.

Comments Off

September 10, 2013

Migrating [AMA] search to Solr

Filed under: Searching,Solr — Patrick Durusau @ 3:59 am

Migrating American Medical Association’s search to Solr by Doug Turnbull.

Read the entire post but one particular point is important to me:

Research journal users value recent publications very highly. Users want to see recent research, not just documents that score well due to how frequently search terms occur in a document. If you were a doctor, would you rather see brain cancer research that occurred this decade or in the early 20th century?

I call this out because it is one of my favorite peeves about Google search results.

Even if generalized date parsing is too hard, Google should know when it first encountered a particular resource.

At the very least a listing by “age” of a link should be trivially possible.

How important is recent information to your users?

Comments Off

September 1, 2013

Notes on DIH Architecture: Solr’s Data Import Handler

Filed under: Searching,Solr — Patrick Durusau @ 6:43 pm

Notes on DIH Architecture: Solr’s Data Import Handler by Mark Bennett.

From the post:

What the world really needs are some awesome examples of extending DIH (Solr DataImportHanlder), beyond the classes and unit tests that ship with Solr. That’s a tall order given DIH’s complexity, and sadly this post ain’t it either! After doing a lot of searches online, I don’t think anybody’s written an “Extending DIH Guide” yet – everybody still points to the Solr wiki, quick start, FAQ, source code and unit tests.

However, in this post, I will review a few concepts to keep in mind. And who knows, maybe in a future post I’ll have some concrete code.

When I make notes, I highlight the things that are different from what I’d expect and why, so I’m going to start with that. Sure DIH has an XML config where you tell it about your database or filesystem or RSS feed, and map those things into your Solr schema, so no surprise there. But the layering of that configuration really surprised me. (and turns out there’s good reasons for it)

If you aspire to be Solr proficient, print this article and work through it.

It will be time well spent.

Comments Off

August 30, 2013

Choosing a PostgreSQL text search method

Filed under: PostgreSQL,Searching — Patrick Durusau @ 8:59 am

Choosing a PostgreSQL text search method by Craig Ringer.

From the post:

(This article is written with reference to PostgreSQL 9.3. If you’re using a newer version please check to make sure any limitations described remain in place.)

PostgreSQL offers several tools for searching and pattern matching text. The challenge is choosing which to use for a job. There’s:

LIKE and ILIKE SQL pattern matching;

~ and ~* operators for mostly-perl-compatible regular expressions;

full text search with @@, to_tsvector and to_tsquery

Use of an external search provider like Apache Lucene / Solr.

There’s also SIMILAR TO, but we don’t speak of that in polite company, and PostgreSQL turns it into a regular expression anyway.

If you are thinking about running a PostgreSQL backend and need text searching, this will be a useful post for you.

I really appreciated Craig’s closing paragraph:

At no point did I try to determine whether LIKE or full-text search is faster for a given query. That’s because it usually doesn’t matter; they have different semantics. Which goes faster, a car or a boat? In most cases it doesn’t matter because speed isn’t your main selection criteria, it’s “goes on water” or “goes on land”.

Something to keep in mind with the “web scale” chorus comes along.

Most of the data of interest to me (not all) isn’t of web scale.

How about you?

Comments Off

August 24, 2013

Name Search in Solr

Filed under: Searching,Solr — Patrick Durusau @ 6:46 pm

Name Search in Solr by Doug Turnbull.

From the post:

Searching names is a pretty common requirement for many applications. Searching by book authors, for example, is a pretty crucial component to a book store. And as it turns out names are actually a surprisingly hard thing to get perfect. Regardless, we can get something pretty good working in Solr, at least for the vast-majority of Anglicized representations.

We can start with the assumption that aside from all the diversity in human names, that a name in our Authors field is likely going to be a small handful of tokens in a single field. We’ll avoid breaking these names up by first, last, and middle names (if these are even appropriate in all cultural contexts). Let’s start by looking at some sample names in our “Authors” field:

Doug has a photo of library shelves in his post with the caption:

Remember the good ole days of “Alpha by Author”?

True but books listed their authors in various forms. Librarians were the ones who imposed a canonical representation on author names.

Doug goes through basic Solr techniques for matching author names when you don’t have the benefit of librarians.

Comments Off

August 21, 2013

ack 2.04

Filed under: Programming,Searching — Patrick Durusau @ 4:36 pm

ack 2.04

From the webpage:

Top 5 reasons to use ack

Blazing fast
It’s fast because it only searches the stuff it makes sense to search.

Better search
Searches entire trees by default while ignoring Subversion, Git and other VCS directories and other files that aren’t your source code.

Designed for code search
Where grep is a general text search tool, ack is especially for the programmer searching source code. Common tasks take fewer keystrokes.

Highly portable
ack is pure Perl, so it easily runs on a Windows installation Perl (like Strawberry Perl) without modifications.

Free and open
Ack costs nothing. It’s 100% free and open source under Artistic License v2.0.

I was doubtful until I saw the documentation page.

I had to concede that there were almost enough command line switches to qualify for a man page. 😉

I suspect it is going to be a matter of personal preference.

See what your personal preference says.

Comments Off

August 20, 2013

Solr Tutorial [No Ads]

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 2:38 pm

Solr Tutorial from the Apache Software Foundation.

A great tutorial on Solr that is different from most of the Solr tutorials you will ever see.

There are no ads, popup or otherwise. 😉

Should be the first tutorial that you recommend for anyone new to Solr!

PS: You do not have to give your email address, phone number, etc. to view the tutorial.

Comments Off

The Curse of Enterprise Search… [9% Solutions]

Filed under: Marketing,Search Requirements,Searching — Patrick Durusau @ 2:23 pm

The Curse of Enterprise Search and How to Break It by Maish Nichani.

From the post:

The Curse

Got enterprise search? Try answering these questions: Are end users happy? Has decision-making improved? Productivity up? Knowledge getting reused nicely? Your return-on-investment positive? If you’re finding it tough to answer these questions then most probably you’re under the curse of enterprise search.

The curse is cast when you purchase an enterprise search software and believe that it will automagically solve all your problems the moment you switch it on. You believe that the boatload of money you just spent on it justifies the promised magic of instant findability. Sadly, this belief cannot be further from the truth.

Search needs to be designed. Your users and content are unique to your organisation. Search needs to work with your users. It needs to make full use of the type of content you have. Search really needs to be designed.

Don’t believe in the curse? Consider these statistics from the Enterprise Search and Findability Survey 2013 done by Findwise with 101 practitioners working for global companies:

Only 9% said it was easy to find the right information within the organisation

Only 19% said they were happy with the existing search application in their organisation

Only 20% said they had a search strategy in place

Just in case you need some more numbers when pushing your better solution to enterprise search.

I wonder how search customers would react to an application that made it easy to find the right data 20% of the time?

Just leaving room for future versions and enhancements. 😉

Maish isn’t handing out silver bullets but a close read will improve your search application (topic map or not).

Comments (1)

August 15, 2013

Search Humor

Filed under: Humor,Searching — Patrick Durusau @ 3:38 pm

I saw a tweet today by nickbarnwell:

Pro-tip: “Hickey Facts” and “Hickey Facts Datomic” turn up vastly different search results #clojure #datomic

Don’t take my word for it:

“Hickey Facts”

“Hickey Facts Datomic”

In case you don’t already know the answer, the first query returns “about” 1,400,000 results and the second query “about” 12,000 results. 😉

Comments Off

August 13, 2013

Of collapsing in Solr

Filed under: Search Engines,Searching,Solr,Topic Maps — Patrick Durusau @ 4:35 pm

Of collapsing in Solr by Paul Masurel.

From the post:

This post is about the innerworkings of one of the two most popular open source search engines : Solr. I noticed that many questions (one or two everyday) on solr-user’s mailing list were about Solr’s collapsing functionality.

I thought it would be a good idea to explain how Solr’s collapsing is working. Because its documentation is very sparse, and because a search engine is the kind of car you to take a peek under the hood to make sure you’ll drive it right.

The Solr documentation at Apache refers to field collapsing and result grouping being “different ways to think about the same Solr feature.”

I read the post along with the Solr documentation.

BTW, note from “Known Limitations” in the Solr documentation:

Support for grouping on a multi-valued field has not yet been implemented.

That would be really nice with subjectIdentifier and subjectLocator having the potential to be sets of values.

Comments Off

Google Search Operators [Improving over Google]

Filed under: Search Behavior,Searching,Topic Maps — Patrick Durusau @ 3:46 pm

How To Make Good Use Of Google’s Search Operators by Craig Snyder.

From the post:

Some of you might not have the slightest clue what an operator is, in terms of using a search engine. Luckily enough, both Google and MakeUseOf offer some pretty good examples of how to use them with the world’s most popular search engine. In plain English, an operator is a tag that you can include within your Google search to make it more precise and specific.

With operators, you’re able to display results that pertain only to certain websites, search through a range of numbers, or even completely exclude a word from your results. When you master the use of Google’s search engine, finding the answer to nearly anything you can think of is a power that you have right at your fingertips. In this article, let’s make that happen.

8 Google Search Tips To Keep Handy At All Times by Dave Parrack.

From the post:

Google isn’t the only game in town when it comes to search. Alternatives such as Bing, DuckDuckGo, and Wolphram Alpha also provide the tools necessary to search the Web. However, the figures don’t lie, and the figures suggest that the majority of Internet users choose Google over the rest of the competition.

With that in mind it’s important to make sure all of those Google users are utilizing all that Google has to offer when it comes to its search engine. Everyone knows how to conduct a normal search by typing some words and/or a phrase into the box provided and following the links that emerge from the overcrowded fog. But Google Search offers a lot more than just the basics.

If friends or colleagues are using Google, I thought these posts might come in handy.

Speaking of the numbers, as of June 13, 2013, Google’s share of the search market was 66.7 percent. Bing was 17.9%, AOL, Inc. the smallest one listed, was at 1.3%. (What does that say to you about DuckDuckGo and Wolphram Alpha?)

Google’s majority share of the search market should be encouraging to anyone working on alternatives.

Why?

Google has left so much room for better search results.

For example, let’s say you find an article and you want to find other articles that rely on it. So you enter the title as a quoted phrase. What do you get back?

If it is a popular article, you may get hundreds of results. You and I both know you are not going to look at every article.

But a number of those articles are just citing the article of interest in a block of citations. Doesn’t have much to do with the results of the article at all.

But Google returns all of those, ranked for sure but you don’t know enough about the ranking to decide if two pages of search results is enough or not. Gold may be waiting on the third page. No way to tell.

Document level search results are just that. Document level search results. You can refine them for yourself but that’s not going to be captured by Google.

What is your example of improvement over the search results we get from Google now?

Comments Off

August 11, 2013

Embedding Concepts in text for smarter searching with Solr4

Filed under: Concept Detection,Indexing,Searching,Solr — Patrick Durusau @ 7:08 pm

Embedding Concepts in text for smarter searching with Solr4 by Sujit Pal.

From the post:

Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept’s position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens – for example, the phrase “breast cancer” becomes “breast0cancer”. This allows us to rewrite queries such as “breast cancer radiotherapy”~5 as “2790981 2791965″~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms – for example, “breast cancer” can be variously called “breast neoplasm”, “breast CA”, etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, “radiotherapy for breast cancer patients” would not match our query since “breast cancer patient” is a different concept than “breast cancer” and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

So if Solr4 can make documents smarter, can the same be said about topics?

Recalling that “document” for Solr is defined by your indexing, not some arbitrary byte count.

As we are indexing topics we could add information to topics to make merging more robust.

One possible topic map flow being:

Index -> addToTopics -> Query -> Results -> Merge for Display.

Yes?

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 20, 2013

November 6, 2013

Online resources

Information retrieval resources

Introduction to Information Retrieval: Table of Contents

November 3, 2013

October 26, 2013

October 24, 2013

October 11, 2013

October 9, 2013

October 8, 2013

September 27, 2013

September 23, 2013

September 22, 2013

September 21, 2013

September 19, 2013

September 16, 2013

September 13, 2013

September 10, 2013

September 1, 2013

August 30, 2013

August 24, 2013

August 21, 2013

August 20, 2013

August 15, 2013

August 13, 2013

August 11, 2013