Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 5, 2012

What’s so cool about elasticsearch?

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 6:04 am

What’s so cool about elasticsearch? by Luca Cavanna.

From the post:

Whenever there’s a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it’s cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what’s so cool about elasticsearch?”, that’s why I thought it might be worthwhile sharing my own answer in this blog.

Its not a staid comparison piece but a partisan, this is cool piece.

You will find it both entertaining and informative. Good weekend reading.

Will give you something to have a strong opinion (one way or the other) next Monday!

September 21, 2012

Local Search – How Hard Can It Be? [Unfolding Searches?]

Filed under: Local Search,Search Behavior,Search Engines,Searching — Patrick Durusau @ 2:20 pm

Local Search – How Hard Can It Be? by Matthew Hurst.

From the post:

This week, Apple got a rude awakening with its initial foray into the world of local search and mapping. The media and user backlash to their iOS upgrade which removes Google as the maps and local search partner and replaces it with their own application (built on licensed data) demonstrates just how important the local scenario is to the mobile space.

While the pundits are reporting various (and sometimes amusing) issues with the data and the search service, it is important to remind ourselves how hard local search can be.

For example, if you search on Google for Key Arena – a major venue in Seattle located in the famous Seattle Center, you will find some severe data quality problems.

See Matthew’s post for the detail but I am mostly interesting in his final observation:

One of the ironies of local data conflation is that landmark entities (like stadia, large complex hotels, hospitals, etc.) tend to have lots of data (everyone knows about them) and lots of complexity (the Seattle Center has lots of things within it that can be confused). These factors conspire to make the most visible entities in some ways the entities more prone to problems.

Every library student is (or should be) familiar with the “reference interview.” A patron asks a question (consider this to be the search request, “Key Arena”) and a librarian uses the reference interview to further identify the information being requested.

Contrast that unfolding of the search request, which at any juncture offers different paths to different goals, with the “if you can identify it, you can find it,” approach of most search engines.

Computers have difficulty searching complex entities such as “Key Arena” successfully. Whereas starting with the same query with a librarian does not.

Doesn’t that suggest to you that “unfolding” searches may be a better model for computer searching than simple identification?

More than static facets, but a presentation of the details most likely to distinguish subjects searched for by users under similar circumstances. Dynamically.

Sounds like the sort of heuristic knowledge that topic maps could capture quite handily.

September 15, 2012

Blame Google? Different Strategy: Let’s Blame Users! (Not!)

Let me quote from A Simple Guide To Understanding The Searcher Experience by Shari Thurow to start this post:

Web searchers have a responsibility to communicate what they want to find. As a website usability professional, I have the opportunity to observe Web searchers in their natural environments. What I find quite interesting is the “Blame Google” mentality.

I remember a question posed to me during World IA Day this past year. An attendee said that Google constantly gets search results wrong. He used a celebrity’s name as an example.

“I wanted to go to this person’s official website,” he said, “but I never got it in the first page of search results. According to you, it was an informational query. I wanted information about this celebrity.”

I paused. “Well,” I said, “why are you blaming Google when it is clear that you did not communicate what you really wanted?”

“What do you mean?” he said, surprised.

“You just said that you wanted information about this celebrity,” I explained. “You can get that information from a variety of websites. But you also said that you wanted to go to X’s official website. Your intent was clearly navigational. Why didn’t you type in [celebrity name] official website? Then you might have seen your desired website at the top of search results.”

The stunned silence at my response was almost deafening. I broke that silence.

“Don’t blame Google or Yahoo or Bing for your insufficient query formulation,” I said to the audience. “Look in the mirror. Maybe the reason for the poor searcher experience is the person in the mirror…not the search engine.”

People need to learn how to search. Search experts need to teach people how to search. Enough said.

What a novel concept! If the search engine/software doesn’t work, must be the user’s fault!

I can save you a trip down the hall to the marketing department. They are going to tell you that is an insane sales strategy. Satisfying to the geeks in your life but otherwise untenable, from a business perspective.

Remember the stats on using Library of Congress subject headings I posted under Subject Headings and the Semantic Web:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

?

That is with decades of teaching people to search both manual and automated systems using Library of Congress classification.

Test Question: I have a product to sell. 60% of my all buyers can’t find it with a search engine. Do I:

  • Teach all users everywhere better search techniques?
  • Develop better search engines/interfaces to compensate for potential buyers’ poor searching?

I suspect the “stunned silence” was an audience with greater marketing skills than the speaker.

September 12, 2012

Fast integer compression: decoding billions of integers per second

Filed under: Algorithms,Integers,Search Engines — Patrick Durusau @ 3:14 pm

Fast integer compression: decoding billions of integers per second by Daniel Lemire.

At > 2 billion integers per second, you may find there is plenty of action left in your desktop processor!

From the post:

Databases and search engines often store arrays of integers. In search engines, we have inverted indexes that map a query term to a list of document identifiers. This list of document identifiers can be seen as a sorted array of integers. In databases, indexes often work similarly: they map a column value to row identifiers. You also get arrays of integers in databases through dictionary coding: you map all column values to an integer in a one-to-one manner.

Our modern processors are good at processing integers. However, you also want to keep much of the data close to the CPU for better speed. Hence, computer scientists have worked on fast integer compression techniques for the last 4 decades. One of the earliest clever techniques is Elias coding. Over the years, many new techniques have been developed: Golomb and Rice coding, Frame-of-Reference and PFOR-Delta, the Simple family, and so on.

The general story is that while people initially used bit-level codes (e.g., gamma codes), simpler byte-level codes like Google’s group varint are more practical. Byte-level codes like what Google uses do not compress as well, and there is less opportunity for fancy information theoretical mathematics. However, they can be much faster.

Yet we noticed that there was no trace in the literature of a sensible integer compression scheme running on desktop processor able to decompress data at a rate of billions of integers per second. The best schemes, such as Stepanov et al.’s varint-G8IU report top speeds of 1.5 billion integers per second.

As your may expect, we eventually found out that it was entirely feasible to decoding billions of integers per second. We designed a new scheme that typically compress better than Stepanov et al.’s varint-G8IU or Zukowski et al.’ PFOR-Delta, sometimes quite a bit better, while being twice as fast over real data residing in RAM (we call it SIMD-BP128). That is, we cleanly exceed a speed of 2 billions integers per second on a regular desktop processor.

We posted our paper online together with our software. Note that our scheme is not patented whereas many other schemes are.

So, how did we do it? Some insights:

September 8, 2012

QRU-1: A Public Dataset…

Filed under: Query Expansion,Query Rewriting,Search Behavior,Search Data,Search Engines — Patrick Durusau @ 4:58 pm

QRU-1: A Public Dataset for Promoting Query Representation and Understanding Research by Hang Li, Gu Xu, W. Bruce Croft, Michael Bendersky, Ziqi Wang and Evelyne Viegas.

ABSTRACT

A new public dataset for promoting query representation and understanding research, referred to as QRU-1, was recently released by Microsoft Research. The QRU-1 dataset contains reformulations of Web TREC topics that are automatically generated using a large-scale proprietary web search log, without compromising user privacy. In this paper, we describe the content of this dataset and the process of its creation. We also discuss the potential uses of the dataset, including a detailed description of a query reformulation experiment.

And the data set:

Query Representation and Understanding Set

The Query Representation and Understanding (QRU) data set contains a set of similar queries that can be used in web research such as query transformation and relevance ranking. QRU contains similar queries that are related to existing benchmark data sets, such as TREC query sets. The QRU data set was created by extracting 100 TREC queries, training a query-generation model and a commercial search engine, generating similar queries from TREC queries with the model, and removal of mistakenly generated queries.

Are query reformulations in essence different identifications of the subject of a search?

But the issue isn’t “more” search results but rather higher quality search results.

Why search engines bother (other than bragging rights) to report “hits” beyond the ones displayed isn’t clear. Just have a “next N hits” button.

You could consider the number of “hits” you don’t look at as a measure of your search engine’s quality. The higher the number…., well, you know. Could be gold in those “hits” but you will never know. And your current search engine will never say.

September 4, 2012

Solr vs. ElasticSearch: Part 2 – Data Handling

Filed under: ElasticSearch,Search Engines,Solr — Patrick Durusau @ 1:47 pm

Solr vs. ElasticSearch: Part 2 – Data Handling by Rafał Kuć.

In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.

  1. Solr vs. ElasticSearch: Part 1 – Overview
  2. Solr vs. ElasticSearch: Part 2 – Data Handling
  3. Solr vs. ElasticSearch: Part 3 – Querying
  4. Solr vs. ElasticSearch: Part 4 – Faceting
  5. Solr vs. ElasticSearch: Part 5 – API Usage Possibilities

Rafal takes a dive into indexing and data handling under Solr and ElasticSearch.

PS: Can you suggest a search engine that does not befoul URLs with tracking information? Or at least consistently presents a “clean” version alongside a tracking version?

August 3, 2012

20% of users – 80% of security breaches?

Filed under: Search Engines,Security — Patrick Durusau @ 6:26 pm

I was reading about the Google Hacking Diggity Project today when it occurred to me to ask:

Are 20% of users responsible for 80% of security breaches?

I ask because:

The Google Hacking Diggity Project is a research and development initiative dedicated to investigating the latest techniques that leverage search engines, such as Google and Bing, to quickly identify vulnerable systems and sensitive data in corporate networks. This project page contains downloads and links to our latest Google Hacking research and free security tools. Defensive strategies are also introduced, including innovative solutions that use Google Alerts to monitor your network and systems.

OK, but that just means you are playing catch up on security breaches. You aren’t ever getting ahead. Discovering weaknesses before others do is hopefully discovering them before others do.

If you coupled a topic map with your security scans, you can track users as they move from department to department, anticipating the next security breach.

And/or providing management with the ability to avoid security breaches in the first place.

I first saw this at KDNuggets.

July 30, 2012

Search Solutions 2012: your opportunity to shape this year’s event

Filed under: Conferences,Search Engines,Searching — Patrick Durusau @ 3:06 pm

Search Solutions 2012: your opportunity to shape this year’s event by Tony Russell-Rose.

From the post:

We’re just in the process of drafting the programme for Search Solutions 2012, to be held on November 28-29 at BCS London. As in previous years, we aim to offer a topical selection of presentations, panels and keynote talks by influential industry leaders on novel and emerging applications in search and information retrieval, whilst maintaining the collegiate spirit of a community event. If you’ve never been before, take a look at last year’s programme.

We don’t normally issue a formal “call for papers” as such, but if you’d like to get involved (perhaps as a panellist or speaker) and have an interesting case study or demo to present, then drop me a line. In the meantime, save the date: tutorials day on November 28, main event on November 29.

The 2011 programme page has presentation downloads if you are having trouble making up your mind. 😉

July 20, 2012

…10 billion lines of code…

Filed under: Open Data,Programming,Search Data,Search Engines — Patrick Durusau @ 5:46 pm

Also know as (aka):

Black Duck’s Ohloh lets data from nearly 500,000 open source projects into the wild by Chris Mayer.

From the post:

In a bumper announcement, Black Duck Software have embraced the FOSS mantra by revealing their equivalent of a repository Yellow Pages, through the Ohloh Open Data Initiative.

The website tracks 488,823 projects, allowing users to compare data from a vast amount of repositories and forges. But now, Ohloh’s huge dataset has been licensed under the Creative Commons Attribution 3.0 Unported license, encouraging further transparency across the companies who have already bought into Ohloh’s aggregation mission directive.

“Licensing Ohloh data under Creative Commons offers both enterprises and the open source community a new level of access to FOSS data, allowing trending, tracking, and insight for the open source community,” said Tim Yeaton, President and CEO of Black Duck Software.

He added: “We are constantly looking for ways to help the open source developer community and enterprise consumers of open source. We’re proud to freely license Ohloh data under this respected license, and believe that making this resource more accessible will allow contributors and consumers of open source gain unique insight, leading to more rapid development and adoption.”

What sort of insight would you expect to gain from “…10 billion lines of code…?”

How would you capture it? Pass it on to others in your project?

Mix or match semantics with other lines of code? Perhaps your own?

July 17, 2012

elasticsearch. The Company

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 3:45 pm

elasticsearch. The Company

ElasticSearch needs no introduction to readers of this blog or really anyone active in the search “space.”

It was encouraging to hear that after years of building an ever increasingly useful product, that ElasticSearch has matured into a company.

With all the warm fuzzies that support contracts and such bring.

Sounds like they will demonstrate that the open source and commercial worlds aren’t, you know, incompatible.

It helps that they have a good product in which they have confidence and not a product that their PR/Sales department is pushing as a “good” product. The fear of someone “finding out” would make you real defensive in the latter case.

Looking forward to good fortune for ElasticSearch, its founders and anyone who wants to follow a similar model.

Searching Legal Information in Multiple Asian Languages

Filed under: Law,Legal Informatics,Search Engines — Patrick Durusau @ 2:42 pm

Searching Legal Information in Multiple Asian Languages by Philip Chung, Andrew Mowbray, and Graham Greenleaf.

Abstract:

In this article the Co-Directors of the Australasian Legal Information Institute (AustLII) explain the need for an open source search engine which can search simultaneously over legal materials in European languages and also in Asian languages, particularly those that require a ‘double byte’ representation, and the difficulties this task presents. A solution is proposed, the ‘u16a’ modifications to AustLII’s open source search engine (Sino) which is used by many legal information institutes. Two implementations of the Sino u16A approach, on the Hong Kong Legal Information Institute (HKLII), for English and Chinese, and on the Asian Legal Information Institute (AsianLII), for multiple Asian languages, are described. The implementations have been successful, though many challenges (discussed briefly) remain before this approach will provide a full multi-lingual search facility.

If the normal run of legal information retrieval, across jurisdictions, vocabularies, etc. challenging enough, you can try your hand at cross-language retrieval with European and Asian languages, plus synonyms, etc.

😉

I would like to think the synonymy issue, which is noted as open by this paper, could be addressed in part through the use of topic maps. It would be an evolutionary solution, to be updated as our use and understanding of language evolves.

Any thoughts on Sino versus Lucene/Solr 4.0 (alpha I know but it won’t stay that way forever).

I first saw this at Legal Informatics.

If you are in Kolkata/Pune, India…a request.

Filed under: Search Engines,Synonymy,Word Meaning,XML — Patrick Durusau @ 1:55 pm

No emails are given for the authors of: Identify Web-page Content meaning using Knowledge based System for Dual Meaning Words but their locations were listed as Kolkata and Pune, India. I would appreciate your pointing the authors to this blog as one source of information on topic maps.

The authors have re-invented a small part of topic maps to deal with synonymy using XSD syntax. Quite doable but I think they would be better served by either using topic maps or engaging in improving topic maps.

Reinvention is rarely a step forward.

Abstract:

Meaning of Web-page content plays a big role while produced a search result from a search engine. Most of the cases Web-page meaning stored in title or meta-tag area but those meanings do not always match with Web-page content. To overcome this situation we need to go through the Web-page content to identify the Web-page meaning. In such cases, where Webpage content holds dual meaning words that time it is really difficult to identify the meaning of the Web-page. In this paper, we are introducing a new design and development mechanism of identifying the Web-page content meaning which holds dual meaning words in their Web-page content.

July 15, 2012

Nutch 1.5/1.5.1 [Cloud Setup for Experiements?]

Filed under: Cloud Computing,Nutch,Search Engines — Patrick Durusau @ 3:41 pm

Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.

From the 1.5 release note:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes

http://www.apache.org/dist/nutch/CHANGES-1.5.txt

[WRONG URL – Should be: http://www.apache.org/dist/nutch/1.5/CHANGES-1.5.txt (version /1.5/” missing from the path, took me a while to notice the nature of the problem.)]

made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below

http://nutch.apache.org/#07+June+2012+-+Apache+Nutch+1.5+Released

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

And 1.5.1:

http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?

As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.

As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”

Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? 😉 Trusting fraud is going to happen.

Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.

I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.

Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.

Apache Nutch v2.0 Release

Filed under: Nutch,Search Engines — Patrick Durusau @ 10:18 am

Apache Nutch v2.0 Release

From the post:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.0. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores. After some two years of development Nutch v2.0 also offers all of the mainstream Nutch functionality and it builds on Apache Solr™ adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika™ for HTML and an array other document formats. Nutch v2.0 shadows the latest stable mainstream release (v1.5.X) based on Apache Hadoop™ and covers many use cases from small crawls on a single machine to large scale deployments on Hadoop clusters. Please see the list of changes

http://www.apache.org/dist/nutch/2.0/CHANGES.txt made in this version for a full breakdown..

A full PMC release statement can be found below:

http://nutch.apache.org/#07+July+2012+-+Apache+Nutch+v2.0+Released

Nutch v2.0 is available in source (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/2.0

July 13, 2012

Broccoli: Semantic Full-Text Search at your Fingertips

Filed under: Broccoli,Ontology,Search Engines,Search Interface,Searching,Semantic Search — Patrick Durusau @ 5:51 pm

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, and Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g. edible), classes (e.g. plants), instances (e.g. Broccoli), and relations (e.g. occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully-functional prototype based on our ideas, see this http URL

It’s good to see CS projects work so hard to find unambiguous names. That won’t be confused with far more common uses of the same names. 😉

For all that, on quick review it does look like a clever, if annoyingly named, project.

Hmmm, doesn’t like the “-” (hyphen) character. “graph-theoretical tree” returns 0 results, “graph theoretical tree” returns 1 (the expected one).

Definitely worth a close read.

One puzzle though. There are a number of projects that use Wikipedia data dumps. The problem is most of the documents I am interested in searching aren’t in Wikipedia data dumps. Like the Enron emails.

Techniques that work well with clean data may work less well with documents composed of the vagaries of human communication. Or attempts at communication.

June 27, 2012

Become a Google Power Searcher

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 9:15 am

Become a Google Power Searcher by Terry Ednacot.

From the post:

You may already be familiar with some shortcuts for Google Search, like using the search box as a calculator or finding local movie showtimes by typing [movies] and your zip code. But there are many more tips, tricks and tactics you can use to find exactly what you’re looking for, when you most need it.

Today, we’ve opened registration for Power Searching with Google, a free, online, community-based course showcasing these techniques and how you can use them to solve everyday problems. Our course is aimed at empowering you to find what you need faster, no matter how you currently use search. For example, did you know that you can search for and read pages written in languages you’ve never even studied? Identify the location of a picture your friend took during his vacation a few months ago? How about finally identifying that green-covered book about gardening that you’ve been trying to track down for years? You can learn all this and more over six 50-minute classes.

Lessons will be released daily starting on July 10, 2012, and you can take them according to your own schedule during a two-week window, alongside a worldwide community. The lessons include interactive activities to practice new skills, and many opportunities to connect with others using Google tools such as Google Groups, Moderator and Google+, including Hangouts on Air, where world-renowned search experts will answer your questions on how search works. Googlers will also be on hand during the course period to help and answer your questions in case you get stuck.

I know, I know, you are way beyond using Google but you may know some people who are not.

Try to suggest this course in a positive way, i.e., non-sneering sort of way.

Will be a new experience.

You may want to “audit” the course.

Would be unfortunate for someone to ask you a Google search question you can’t answer.

😉

June 25, 2012

In the red corner – PubMed and in the blue corner – Google Scholar

Filed under: Bioinformatics,Biomedical,PubMed,Search Engines,Searching — Patrick Durusau @ 7:40 pm

Medical literature searches: a comparison of PubMed and Google Scholar by Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik and Kenneth Nugent. (Health Information & Libraries Journal, Article first published online: 19 JUN 2012)

From the abstract:

Background

Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown.

Objectives

We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality.

Methods

Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals’ impact factors.

Results

Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036). Conclusions

PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.

I have several concerns that may or may not be allied by further investigation:

  • Four queries seems like an inadequate basis for evaluation. Not that I expect to see one “winner” and one “loser,” but am more concerned with what lead to the differences in results.
  • It is unclear why a citation from a journal with a higher impact factor is superior to one with a lesser impact factor? I assume the point of the query is to obtain a useful result (in the sense of medical treatment, not tenure).
  • Neither system enabled users to build upon the query experience of prior users with a similar query.
  • Neither system enabled users to avoid re-reading the same texts as other had read before them.

Thoughts?

Google search parameters in 2012

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 3:02 pm

Google search parameters in 2012

From the post:

Knowing the parameters Google uses in its search is not only important for SEO geeks. It allow you to use shortcuts and play with the Google filters. The parameters also reveal more juicy things: Is it safe to share your Google search URLs or screenshots of your Google results? This post argues that it is important to be aware of the complicated nature of the Google URL. As we will see later posting your own Google URL can reveal personal information about you that you might not feel too comfortable sharing. So read on to learn more about the Google search parameters used in 2012.

Why do I say “in 2012″? Well, the Google URL changed over time and more parameters were added to keep pace with the increasing complexity of the search product, the Google interface and the integration of verticals. Before looking at the parameter table below, though, I encourage you to quickly perform the following 2 things:

  1. Go directly to Google and search for your name. Look at the URL.
  2. Go directly to DuckDuckGo and perform the same search. Look at the URL.

This little exercise serves well to demonstrate just how simple and how complicated URLs used by search engines can look like. These two cases are at the opposing ends: While DuckDuckGo has only one search parameter, your query, and is therefore quite readable, Google uses a cryptic construct that only IT professionals can try to decipher. What I find interesting is that on my Smartphone, though, the Google search URL is much simpler than on the desktop.

This blog post is primarily aimed at Google’s web search. I will not look at their other verticals such as scholar or images. But because image search is so useful, I encourage you to look at the image section of the Unofficial Google Advanced Search guide

The tables of search parameters are a nice resource.

Suggestions of similar information for other search engines?

June 24, 2012

What’s Your Default Search Engine?

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 3:55 pm

Bing’s Evolving Local Search by Matthew Hurst.

From the post:

Recently, there have been a number of announcements regarding the redesign of Bing’s main search experience. The key difference is the use of three parallel zones in the SERP. Along with the traditional page results area, there are two new results columns: the task pane, which highlights factual data and the social pane which currently highlights social information from individuals (I distinguish social from ‘people’ as entities – for example a restaurant – can have a social presence even though they are only vaguely regarded as people).

I don’t get out much but I can appreciate the utility of the aggregate results for local views.

Matthew writes:

  1. When we provide flat structured data (as Bing did in the past), while we continued to strive for high quality data, there is no burning light focused on any aspect of the data. However, when we require to join the data to the web (local results are ‘hanging off’ the associated web sites), the quality of the URL associated with the entity record becomes a critical issue.
  2. The relationship between the web graph and the entity graph is subtle and complex. Our legacy system made do with the notion of a URL associated with an entity. As we dug deeper into the problem we discovered a very rich set of relationships between entities and web sites. Some entities are members of chains, and the relationships between their chain home page and the entity is quite different from the relationship between a singleton business and its home page. This also meant that we wanted to treat the results differently. See below for the results for {starbucks in new york}
  3. The structure of entities in the real world is subtle and complex. Chains, franchises, containment (shop in mall, restaurant in casino, hotel in airport), proximity – all these qualities of how the world works scream out for rich modeling if the user is to be best supported in navigating her surroundings.

Truth be told, the structure of entities in the “real world” and their representatives (somewhere other than the “real” world), not to mention their relationships to each other, are all subtle and complex.

That is part of what makes searching, discovery, mapping such exciting areas for exploration. There is always something new just around the next corner.

June 22, 2012

Virtual Documents: “Search the Impossible Search”

Filed under: Indexing,Search Data,Search Engines,Virtual Documents — Patrick Durusau @ 4:17 pm

Virtual Documents: “Search the Impossible Search”

From the post:

The solution was to build an indexing pipeline specifically to address this user requirement, by creating “virtual documents” about each member of staff. In this case, we used the Aspire content processing framework as it provided a lot more flexibility than the indexing pipeline of the incumbent search engine, and many of the components that were needed already existed in Aspire’s component library.

[graphic omitted]

Merging was done selectively. For example, documents were identified that had been authored by the staff member concerned and from those documents, certain entities were extracted including customer names, dates and specific industry jargon. The information captured was kept in fields, and so could be searched in isolation if necessary.

The result was a new class of documents, which existed only in the search engine index, containing extended information about each member of staff; from basic data such as their billing rate, location, current availability and professional qualifications, through to a range of important concepts and keywords which described their previous work, and customer and industry sector knowledge.

Another tool to put in your belt but I wonder if there is a deeper lesson to be learned here?

Creating a “virtual” document, unlike anyone that existed in the target collection and indexing those “virtual” documents was a clever solution.

But it retains the notion of a “container” or “document” that is examined in isolation from all other “documents.”

Is that necessary? What are we missing if we retain it?

I don’t have any answers to those questions but will be thinking about them.

Comments/suggestions?

June 8, 2012

Apache Nutch 1.5 Released!

Filed under: Nutch,Search Engines,Searching — Patrick Durusau @ 8:55 pm

Apache Nutch 1.5 Released!

From the homepage:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes made in this version for a full breakdown of the 50 odd improvements the release boasts. The release is available here.

If you are looking for documentation, may I suggest the Nutch wiki?

June 2, 2012

Social Meets Search with the Latest Version of Bing…

Filed under: Search Engines,Searching,Social Media — Patrick Durusau @ 10:29 am

Social Meets Search with the Latest Version of Bing…

Two things are obvious:

  • I am running a day behind.
  • Bing isn’t my default search engine. (Or I would have noticed this yesterday.)

From the post:

A few weeks ago, we introduced you to the most significant update to Bing since our launch three years ago, combining the best of search with relevant people from your social networks, including Facebook and Twitter. After the positive response to the preview, the new version of Bing is available today in the US at www.bing.com. You can now access Bing’s new three column design , including the snapshot feature and social features.

According to a recent internal survey, nearly 75 % of people spend more time than they would like searching for information online. With Bing’s new design, you can access information from the Web including friends you do know and relevant experts that you may not know letting you spend less time searching and more time doing.

(screenshot omitted)

Today, we’re also unveiling a new advertising campaign to support the introduction of search plus social and announcing the Bing Summer of Doing, in celebration of the new features and designed to inspire people to do amazing things this summer.

BTW, correcting the HTML code in the post for Bing, www.bing.com.

When I arrived, the “top” searches were:

  • Nazi parents
  • Hosni Mubarak

“Popular” searches ranging from the inane to the irrelevant.

I need something a bit more focused on subjects of interest to me.

Perhaps automated queries that are filtered, then processed into a topic map?

Something to think about over the summer. More posts to follow on that theme.

May 30, 2012

LessJunk.org

Filed under: LessJunk.org,Search Engines — Patrick Durusau @ 2:22 pm

LessJunk.org

From the about page:

Less Junk is a search engine that aims to help sift through the junk on the internet. Let’s face it, there is way too much on the internet, and sometimes, you can’t find good information with a regular search engine. Less Junk searches only the top 5000 sites in the world based on user votes, so you know that you’re only searching the good stuff on the internet. A typical search can return literally millions of results, so you can see why this is helpful in a lot of cases. Our goal isn’t to replace the big name search engines, but rather to supplement them, and take off where they left off. Less Junk brings the social, human element to an industry that is defined by “crawlers” and “robots”.

A crude measure for “junk,” < 5,000th site based on votes. As useful as more elaborate and expensive measures for quality? Looking at the numbers for total votes all time:

  • Apple 13 votes
  • Facebook 17 votes
  • Microsoft 13 votes
  • Yahoo 14 votes
  • Youtube 23 votes

It looks like a social site that hasn’t quite gone social. If you know what I mean.

You have to wonder about Youtube getting 23 votes as “less junk” in any contest.

Still, limiting the range of search content, perhaps not by voting, may be a good idea.

Thoughts on criteria other than social popularity for limiting the range of material to be searched?

May 28, 2012

The Anatomy of Search Technology: Crawling using Combinators [blekko – part 2]

Filed under: blekko,Search Engines — Patrick Durusau @ 7:09 pm

The Anatomy of Search Technology: Crawling using Combinators by Greg Lindahl.

From the post:

This is the second guest post (part 1) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

What’s so hard about crawling the web?

Web crawlers have been around as long as the Web has — and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages

In part 2, you learn why you were supposed to pay attention to combinators in part 1.

Want to take a few minutes to refresh on part 1?

Crawler problems still exist but you may have some new approaches to try.

The Anatomy of Search Technology: blekko’s NoSQL database [part 1]

Filed under: blekko,Search Engines — Patrick Durusau @ 6:57 pm

The Anatomy of Search Technology: blekko’s NoSQL database by Greg Lindahl.

From the post:

This is a guest post by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters.

Imagine that you’re crazy enough to think about building a search engine. It’s a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk — that’s several thousand 1 terabyte disks — and produces an index that’s about 100 terabytes in size.

Greg starts with the storage aspects of the blekko search engine before taking on crawling in part 2 of this series.

Pay special attention to the combinators. You will be glad you did.

Whoosh

Filed under: Python,Search Engines,Whoosh — Patrick Durusau @ 9:24 am

Whoosh

From the documentation:

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one if its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine [1].

Practically no important behavior of Whoosh is hard-coded. Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

[1] It would of course be possible to build a turnkey search engine on top of Whoosh, like Nutch and Solr use Lucene.

Haven’t inventoried script based search engines but perhaps I should.

Experiments with indexing/search behaviors might be easier (read more widespread) with scripting languages.

Comments/suggestions?

May 15, 2012

History matters

Filed under: Search Behavior,Search Engines,Search History — Patrick Durusau @ 7:17 pm

History matters by Gene Golovchinsky.

Whose history? Your history. Your search history. Visualized.

Interested? Read more:

Exploratory search is an uncertain endeavor. Quite often, people don’t know exactly how to express their information need, and that need may evolve over time as information is discovered and understood. This is not news.

When people search for information, they often run multiple queries to get at different aspects of the information need, to gain a better understanding of the collection, or to incorporate newly-found information into their searches. This too is not news.

The multiple queries that people run may well retrieve some of the same documents. In some cases, there may be little or no overlap between query results; at other times, the overlap may be considerable. Yet most search engines treat each query as an independent event, and leave it to the searcher to make sense of the results. This, to me, is an opportunity.

Design goal: Help people plan future actions by understanding the present in the context of the past.

While web search engines such as Bing make it easy for people to re-visit some recent queries, and early systems such as Dialog allowed Boolean queries to be constructed by combining results of previously-executed queries, these approaches do not help people make sense of the retrieval histories of specific documents with respect to a particular information need. There is nothing new under the sun, however: Mark Sanderson’s NRT system flagged documents as having been previously retrieved for a given search task, VOIR used retrieval histograms for each document, and of course a browser maintains a limited history of activity to indicate which links were followed.

Our recent work in Querium (see here and here) seeks to explore this space further by providing searchers with tools that reflect patterns of retrieval of specific documents within a search mission.

Even more interested? Read Gene’s post in full.

If not, check your pulse.

May 13, 2012

Zero Tolerance Search : 24 year old neuroscientist

Filed under: Search Engines,Searching — Patrick Durusau @ 6:48 pm

Zero Tolerance Search : 24 year old neuroscientist

Matthew Hurst writes:

[The idea behind ‘zero tolerance search’ posts is to illustrate real life search interactions that show how far we have to go in leveraging the explicit and implicit data in the web and elsewhere.]

Yesterday, I heard part of an interview on NPR. The interview was around a new book on determinism and neuroscience. The only thing I remember about the author was his young age. I wanted to recover the name of the author and the title of his new book so that I could comment on his argument against determinism (which was, essentially, ‘I’m afraid of determinism therefore it can’t be right’).

Matthew continues to outline how the text matching of major search engines fail.

How would you improve the results?

April 25, 2012

LAILAPS

LAILAPS

From the website:

LAILAPS combines a keyword driven search engine for an integrative access to life science databases, machine learning for a content driven relevance ranking, recommender systems for suggestion of related data records and query refinements with a user feedback tracking system for an self learning relevance training.

Features:

  • ultra fast keyword based search
  • non-static relevance ranking
  • user specific relevance profiles
  • suggestion of related entries
  • suggestion of related query terms
  • self learning by user tracking
  • deployable at standard desktop PC
  • 100% JAVA
  • installer for in-house deployment

I like the idea of a recommender system that “suggests” related data records and query refinements. It could be wrong.

I am as guilty as anyone of thinking in terms of “correct” recommendations that always lead to relevant data.

That is applying “crisp” set thinking to what is obviously a “rough” set situation. We as readers have to sort out the items in the “rough” set and construct for ourselves, a temporary and fleeting “crisp” set for some particular purpose.

If you are using LAILAPS, I would appreciate a note about your experiences and impressions.

April 16, 2012

Query Rewriting in Search Engines

Filed under: Query Expansion,Query Rewriting,Search Engines — Patrick Durusau @ 7:12 pm

Query Rewriting in Search Engines by Hugh Williams was mentioned in Amazon CloudSearch, Elastic Search as a Service by Jeff Dalton. (You need to read Jeff’s comments on Amazon Cloudsearch but onto query rewriting.)

From the post:

There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.

What is query rewriting?

Let’s start with an example. Suppose a user queries for Gucci handbags at eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?

Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Gucci handbags as their query.

To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.

You really need to read the post by Hugh a couple of times.

Query rewriting is approaching the problem of subject identity from the other side of topic maps.

Topic maps collect different identifiers for a subject as a basis for “merging”.

Query rewriting changes a query so it specifies different identifiers for a subject.

Let me try to draw a graphic for you (my graphic skills are crude at best):

Topic Maps versus Query Rewrite

I used “/” as an alternative marker for topic maps to illustrate that matching any identifier returns all of them. For query rewrite, the “+” sign indicates that each identifier is searched for in addition to the others.

The result is the same set of identifiers and results from using them on a query set.

From a different point of view.

« Newer PostsOlder Posts »

Powered by WordPress