Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 15, 2012

Facebook Search- The fall of the machines

Filed under: Facebook,Search Engines,Searching — Patrick Durusau @ 7:15 pm

Facebook Search- The fall of the machines by Ajay Ohri.

Ajay gives five numbered reasons and then one more for preferring Facebook searching.

I hardly ever visit Facebook (I do have an account) and certainly don’t search using it.

But we could trade stories, rumors, etc. all day.

How would we test Facebook versus other search engines?

Or for that matter, how would we test search engines in general?

When we say search A got a “better” result using search engine Z, by what measure do we mean “better?”

April 12, 2012

Amazon CloudSearch – Start Searching in One Hour for Less Than $100 / Month

Filed under: Amazon CloudSearch,Search Engines,Searching — Patrick Durusau @ 10:50 am

Amazon CloudSearch – Start Searching in One Hour for Less Than $100 / Month

Jeff Barr, AWS Evangelist, has the easiest job on the Net! How hard can it be to bring “good news” (the original meaning of evangelist) when it just pours out from AWS. If you are an Amazon veep or some such, be assured that managing that much good news is hard. What do you say first?

From the post:

Continuing along in our quest to give you the tools that you need to build ridiculously powerful web sites and applications in no time flat at the lowest possible cost, I’d like to introduce you to Amazon CloudSearch. If you have ever searched Amazon.com, you’ve already used the technology that underlies CloudSearch. You can now have a very powerful and scalable search system (indexing and retrieval) up and running in less than an hour.

You, sitting in your corporate cubicle, your coffee shop, or your dorm room, now have access to search technology at a very affordable price. You can start to take advantage of many years of Amazon R&D in the search space for just $0.12 per hour (I’ll talk about pricing in depth later).

What is Search?

Search plays a major role in many web sites and other types of online applications. The basic model is seemingly simple. Think of your set of documents or your data collection as a book or a catalog, composed of a number of pages. You know that you can find the desired content quickly and efficiently by simply consulting the index.

Search does the same thing by indexing each document in a way that facilitates rapid retrieval. You enter some terms into a search box and the site responds (rather quickly if you use CloudSearch) with a list of pages that match the search terms.

My only quibble with the announcement is that it makes search sound too easy. Jeff does mention all the complex things you can do but a casual reader is left with the impression that search isn’t all that hard.

Well, I suppose search isn’t that hard but good searching is. Some very large concerns that have made mediocre searching a real cash cow.

That model, the mediocre searching model, may not work for you. In that case, you can still use Amazon CloudSearch but you had best get some expert searching advice to go along with it.

April 3, 2012

Getting More Out of a Solr Solution

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:17 pm

Getting More Out of a Solr Solution

Jasmine Ashton reports:

Carsabi, a used car search engine, recently posted a write-up that provides readers with useful tips on how to make Solr run faster in “Optimizing Solr (or How to 7x Your Search Speed).”

According to the article, after experiencing problems with their current system, Carsabi switched to WebSolr’s Solr solution which worked until the website reached 1 million listings. In order to make the process work better, Carsabi expanded its hardware storage capacity. While this improved the speed a great deal, they still weren’t satisfied.

A good example of why we should all look beyond the traditional journals, conference proceedings and blogs for experiences with search engines. Where we will find advice like:

Software: Shard that Sh*t

Is that “active” voice? 😉

I should write so clearly and directly.

April 1, 2012

Precise data extraction with Apache Nutch

Filed under: Nutch,Search Data,Search Engines,Searching — Patrick Durusau @ 7:12 pm

Precise data extraction with Apache Nutch By Emir Dizdarevic.

From the post:

Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides us a mechanism (extension point) to attach an HtmlParserFilter to it.

We developed a plugin, which consists of HtmlParserFilter and IndexingFilter extensions, which provides a mechanism to fetch and index the desired data from a web page trough use of XPath 1.0. The name of the plugin is filter-xpath plugin.

Using this plugin we are now able to extract the desired data from web site with known structure. Unfortunately the plugin is an extension of the HtmlParserFilter extension point which is hardly coupled to the HtmlParser, hence plugin won’t work without the HtmlParser. The HtmlParser generates its own metadata (host, site, url, content, title, cache and tstamp) which will be indexed too. One way to control this is by not including IndexFilter plugins which depend on the metadata data to generate the indexing data (NutchDocument). The other way is to change the SOLR index mappings in the solrindex-mapping.xml file (maps NutchDocument fields to SolrInputDocument field). That way we will index only the fields we want.

The next problem arises when it comes to indexing. We want that Nutch fetches every page on the site but we don’t want to index them all. If we use the UrlRegexFilter to control this we will loose the indirect links which we also want to index and add to our URL DB. To address this problem we developed another plugin which is a extension of the IndexingFilter extension point which is called index-omit plugin. Using this plugin we are able to omit indexing on the pages we don’t need.

Great post on precision and data extraction.

And a lesson that indexing more isn’t the same thing as indexing smarter.

March 25, 2012

A Twelve Step Program for Searching the Internet

Filed under: Common Crawl,Search Data,Search Engines,Searching — Patrick Durusau @ 7:14 pm

OK, the real title is: Twelve steps to running your Ruby code across five billion web pages

From the post:

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

A call to action and an awesome post!

If you have ever forwarded a blog post, forward this one.

This would make a great short course topic. Will have to give that some thought.

March 23, 2012

Trouble at the text mine

Filed under: Data Mining,Search Engines,Searching — Patrick Durusau @ 7:24 pm

Trouble at the text mine by Richard Van Noorden.

From the post:

When he was a keen young biology graduate student in 2006, Max Haeussler wrote a computer program that would scan, or ‘crawl’, plain text and pull out any DNA sequences. To test his invention, the naive text-miner downloaded around 20,000 research papers that his institution had paid to access — and promptly found his IP address blocked by the papers’ publisher.

It was not until 2009 that Haeussler, then at the University of Manchester, UK, and now at the University of California, Santa Cruz, returned to the project in earnest. He had come to realize that standard site licences do not permit systematic downloads, because publishers fear wholesale theft of their content. So Haeussler began asking for licensing terms to crawl and text-mine articles. His goal was to serve science: his program is a key part of the text2genome project, which aims to use DNA sequences in research papers to link the publications to an online record of the human genome. This could produce an annotated genome map linked to millions of research articles, so that biologists browsing a genomic region could immediately click through to any relevant papers.

But Haeussler and his text2genome colleague Casey Bergman, a genomicist at the University of Manchester, have spent more than two years trying to agree terms with publishers — and often being ignored or rebuffed. “We’ve learned it’s a long, hard road with every journal,” says Bergman.

What Haeussler and Bergman don’t seem to “get” is that publishers have no interest in advancing science. Their sole and only goal is profiting from the content they have published. (I am not going to argue right or wrong but am simply trying to call out the positions in question.)

The question that Haeussler and Bergman should answer for publishers is this one: What is in this “indexing” for the publishers?

I suspect one acceptable answer would run along the lines of:

  • The full content of articles cannot be reconstructed from the indexes. The largest block of content delivered will be the article abstract, along with bibliographic reference data.
  • Pointers to the articles will point towards either the publisher’s content site and/or other commercial content providers that carry the publisher’s content.
  • The publisher’s designated journal logo (of some specified size) will appear with every reported citation.
  • The indexed content will be provided to the publisher’s at no charge.

Does this mean that publisher’s will be benefiting from allowing the indexing of their content? Yes. Next question.

Challenges in maintaining a high performance search engine written in Java

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 7:24 pm

Challenges in maintaining a high performance search engine written in Java

You will find this on the homepage or may have to search for it. I was logged in when I accessed it.

Very much worth your time for a high level overview of issues that Lucene will face sooner rather than later.

After reviewing, think about it, make serious suggestions and if possible, contributions to the future of Lucene.

Just off the cuff, I would really like to see Lucene become a search engine framework with a default data structure that admits to either extension or replacement by other data structures. Some data structures may have higher performance costs than others, but if that is what your requirements call for, they can hardly be wrong. Yes? A “fast” search engine that doesn’t meet your requirements is no prize.

March 14, 2012

Keyword Indexing for Books vs. Webpages

Filed under: Books,Indexing,Keywords,Search Engines — Patrick Durusau @ 7:35 pm

I was watching a lecture on keyword indexing that started off with a demonstration of an index to a book, which was being compared to indexing web pages. The statement was made that the keyword pointed the reader to a page where that keyword could be found, much like a search engine does for a web page.

Leaving aside the more complex roles that indexes for books play, such as giving alternative terms, classifying the nature of the occurrence of the term (definition, mentioned, footnote, etc.), cross-references, etc., I wondered if there is a difference between a page reference in a book index vs. a web page reference by a search engine?

In some 19th century indexes I have used, the page references are followed by a letter of the alphabet, to indicate that the page is divided into sections, sometimes as many as a – h or even higher. Mostly those are complex reference works, dictionaries, lexicons, works of that type, where the information is fairly dense. (Do you know of any modern examples of indexes where pages are divided? A note would be appreciated.)

I have the sense that an index of a book, without sub-dividing a page, is different from a index pointing to a web page. It may be a difference that has never been made explicit but I think it is important.

Some facts about word length on a “page:”

With a short amount of content, average book page length, the user has little difficulty finding an index term on a page. But the longer the web page, the less useful our instinctive (trained?) scan of the page becomes.

In part because part of the page scrolls out of view. As you may know, that doesn’t happen with a print book.

Scanning of a print book is different from scanning of a webpage. How to account for that difference I don’t know.

Before you suggest Ctrl-F, see Do You Ctrl-F?. What was it you were saying about Ctrl-F?

Web pages (or other electronic media) that don’t replicate the fixed display of book pages result in a different indexing experience for the reader.

If a search engine index could point into a page, it would still be different from a traditional index but would come closer to a traditional index.

(The W3C has steadfastly resisted any effective subpage pointing. See the sad history of XLink/XPointer. You will probably have to ask insiders but it is a well known story.)

BTW, in case you are interested in blog length, see: Bloggers: This Is How Long Your Posts Should Be. Informative and amusing.

February 28, 2012

Metaphorical search engine finds creative new meanings

Filed under: Metaphors,Search Engines — Patrick Durusau @ 8:42 pm

Metaphorical search engine finds creative new meanings

From the post:

TYPING “love” into Google, I find the Wikipedia entry, a “relationship calculator” and Lovefilm, a DVD rental service. Doing the same in YossarianLives, a new search engine due to launch this year, I might receive quite different results: “river”, “sleep” and “prison”. Its creators claim YossarianLives is a metaphorical search engine, designed to spark creativity by returning disparate but conceptually related terms. So the results perhaps make sense if you accept that love can ebb and flow, provide rejuvenating comfort or just make you feel trapped.

“Today’s internet search tells us what the world already knows,” explains the CEO of YossarianLives, J. Paul Neeley. “We don’t want you to know what everyone else knows, we want you to generate new knowledge.” He says that metaphors help us see existing concepts in a new way and create innovative ideas. For example, using a Formula 1 pit crew as a metaphor for doctors in an emergency room has helped improve medical procedures. YossarianLives aims to create new metaphors for designers, artists, writers or even scientists.

The name is derived from the anti-hero of the novel Catch-22, as the company wants to solve the catch-22 of existing search engines, which they say help us to access current knowledge but also harm us by reinforcing that knowledge above all else.

Sounds too good to be true but good things do happen.

What do you think?

February 14, 2012

Querying joined data within a search engine index U.S. Patent 8,073,840

Filed under: Patents,Search Engines — Patrick Durusau @ 5:03 pm

Querying joined data within a search engine index U.S. Patent 8,073,840

Abstract:

Techniques and systems for indexing and retrieving data and documents stored in a record-based database management system (RDBMS) utilize a search engine interface. Search-engine indices are created from tables in the RDBMS and data from the tables is used to create “documents” for each record. Queries that require data from multiple tables may be parsed into a primary query and a set of one or more secondary queries. Join mappings and documents are created for the necessary tables. Documents matching the query string are retrieved using the search-engine indices and join mappings.

Is anyone maintaining an index or topic map of search engine/technique patents?

If such a resource was public it might be of assistance to patent examiners.

I say “might” because I have yet to see a search technology patent that would survive even minimal knowledge of prior art.

Knowledge of prior art in a field isn’t a qualification or at least not an important one for patent examiners.

My suggestion is that we triple the estimated cost of a patent and start selling them on a same day basis. Skip the fiction of examination and make some money for the government in the process.

People can pay their lawyers to fight out overlapping patents in the courts.

February 10, 2012

Semantic based web search engines- Changing the world of Search

Filed under: Search Engines — Patrick Durusau @ 4:11 pm

Semantic based web search engines- Changing the world of Search by Prachi Nagpal.

From the post:

An important quality that the majority of search engines functional today lack is the ability to take into account the intention of the user behind the overall query. Basing the matching of web pages on keyword frequency and a ranking metric such as Pagerank return various results that may be of high ranking but still irrelevant to the users intended context. Therefore I explored and realized that there is a need to add semantics to the web search.

There are some semantic search engines that have already come up in the markets eg. Hakia, Swoogle, Kosmix, etc. that takes a semantic based approach which is different from the traditional search engines. I really liked their idea of implementing and adding semantics to the web search. This provoked me to do more research in this field and tried to think of different ways to add semantics.

Following is an Algorithm that can be used in Semantic Based Web Search Engines :-

So, to find web pages on the internet that match a user’s query based not only on the important keywords in a user query, but also based on the intention of the user, behind that query, first the user’s entered query is expanded using WordNet ontology.

This Algorithm focuses on work that uses the Hypernym/Hyponymy and Synset relations in WordNet for query expansion algorithm. A set of words that are highly related to the words in the user query, determined by the frequency of their occurrence in the Synset and Hyponym tree for each of the user query terms is created. This set is now refined using the same relations to get a more precise and accurate expanded query.

Interesting approach but as the comments indicate, a lack of use of RDF makes it problematic.

I would rephrase the problem statement from:

…the majority of search engines functional today lack is the ability to take into account the intention of the user behind the overall query.

to: …the majority of search engines lack the ability to accurately interpret the semantics to web accessible content.

I think we would all agree that web accessible content has semantics.

The problem is how to bring those semantics to the attention of search engines?

Or perhaps better, how do we take advantage of those semantics with current search engines, which are semantically deaf and dumb?

January 26, 2012

Employee productivity: 21 critical minutes (no-line-item (nli) in the budget?)

Filed under: Marketing,Search Engines,Searching — Patrick Durusau @ 6:54 pm

Employee productivity: 21 critical minutes by Gilles ANDRE.

From the post:

Twenty-one minutes a day. That’s how long employees spend each day searching for information they know exists but is hard to find. These 21 minutes cost their company the equivalent of €1,500 per year per employee. That’s an average of two whole working weeks. This particular Mindjet study is, of course, somewhat anecdotal and some research firms such as IDC put the figure as high as €10,000 per year. These findings signal a new challenge facing businesses: employees know that the information is there, but they cannot find it. This stalemate can become extremely costly and, in some cases, can even kill off a business. Are companies really aware of this problem?

(paragraph and graphic omitted)

So far, companies have responded to this rising tide of data by spending money. They have invested large, even enormous sums in solutions to store, secure and access their information – one of the key assets of their business. They have also invested heavily in a range of different applications to meet their operational needs. Yet these same applications have created vast information silos spanning their entire organisation. Interdepartmental communication is stifled and information travels like vehicles on the M25 during rush hour.

The link to Mindjet is to their corporate website and not to the study. Ironically I did search at the Mindjet site, the solution Polyspot suggests and came up empty for “21 minutes.” You would think that would be in the report somewhere as a string.

I suspect 21 minutes would be on the low side of lost employee productivity on a daily basis.

But it isn’t hard to discover why businesses have failed to address that loss in employee productivity.

Take out the latest annual report for your business with a line item budget in it. Examine it carefully and then answer the following question:

At what line item is lost employee productivity reported?

Now imagine that your CIO proposes to make information once found, found for all employees. A mixture of a search engine, indexing, topic map, with a process to keep it updated.

You don’t know the exact figures but do you think there would be a line item in the budget from such a project?

And, would there be metrics to determine if the project succeeded or failed?

Ah, so, if the business continues to lose employee productivity there is no metric for success or failure and it never shows up as a line item in the budget.

That is the safe position.

At least until the business collapses and/or is overtaken by other companies.

If you are interested in over taking no-line-item (nli) companies consider evolving search applications that incorporate topic maps.

Topic maps: Information once found, stays found.

January 25, 2012

Searching and Browsing Linked Data with SWSE: the SemanticWeb Search Engine

Filed under: Linked Data,RDF,Search Engines,Semantic Web — Patrick Durusau @ 3:30 pm

Searching and Browsing Linked Data with SWSE: the SemanticWeb Search Engine by Aidan Hogan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres and Stefan Decker.

Abstract:

In this paper, we discuss the architecture and implementation of the SemanticWeb Search Engine (SWSE). Following traditional search engine architecture, SWSE consists of crawling, data enhancing, indexing and a user interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over RDF Web data { loosely also known as Linked Data { which implies unique challenges for the system design, architecture, algorithms, implementation and user interface. In particular, many challenges exist in adopting Semantic Web technologies for Web data: the unique challenges of the Web { in terms of scale, unreliability, inconsistency and noise { are largely overlooked by the current Semantic Web standards. Herein, we describe the current SWSE system, initially detailing the architecture and later elaborating upon the function, design, implementation and performance of each individual component. In so doing, we also give an insight into how current Semantic Web standards can be tailored, in a best-eff ort manner, for use on Web data. Throughout, we o ffer evaluation and complementary argumentation to support our design choices, and also off er discussion on future directions and open research questions. Later, we also provide candid discussion relating to the diffculties currently faced in bringing such a search engine into the mainstream, and lessons learnt from roughly six years working on the Semantic Web Search Engine project.

This is the paper that Ivan Herman mentions at Nice reading on Semantic Search.

It covers a lot of ground in fifty-five (55) pages but it doesn’t take long to hit an issue I wanted to ask you about.

At page 2, Google is described as follows:

In the general case, Google is not suitable for complex information gathering tasks requiring aggregation from multiple indexed documents: for such tasks, users must manually aggregate tidbits of pertinent information from various recommended heterogeneous sites, each such site presenting information in its own formatting and using its own navigation system. In e ffect, Google’s limitations are predicated on the lack of structure in HTML documents, whose machine interpretability is limited to the use of generic markup-tags mainly concerned with document rendering and linking. Although Google arguably makes the best of the limited structure available in such documents, most of the real content is contained in prose text which is inherently diffcult for machines to interpret. Addressing this inherent problem with HTML Web data, the Semantic Web movement provides a stack of technologies for publishing machine-readable data on the Web, the core of the stack being the Resource Description Framework (RDF).

A couple of observations:

Although Google needs no defense from me, I would argue that Google never set itself the task of aggregating information from indexed documents. Historically speaking, IR has always been concerned with returning relevant documents and not returning irrelevant documents.

Second, the lack of structure in HTML documents (although the article mixes in sites with different formatting) is no deterrent to a human reader aggregating “tidbits of pertinent information….” I rather doubt that writing all the documents in valid Springer LaTeX would make that much difference on the “tidbits of pertinent information” score.

This is my first pass through the article and I suspect it will take three or more to become comfortable with it.

Do you agree/disagree that the task of IR is to retrieve documents, not “tidbits of pertinent information?”

Do you agree/disagree that HTML structure (or lack thereof) is that much of an issue for interpretation of document?

Thanks!

Documents as geometric objects: how to rank documents for full-text search

Filed under: PageRank,Search Engines,Vector Space Model (VSM) — Patrick Durusau @ 3:27 pm

Documents as geometric objects: how to rank documents for full-text search Michael Nielsen on July 7, 2011.

From the post:

When we type a query into a search engine – say “Einstein on relativity” – how does the search engine decide which documents to return? When the document is on the web, part of the answer to that question is provided by the PageRank algorithm, which analyses the link structure of the web to determine the importance of different webpages. But what should we do when the documents aren’t on the web, and there is no link structure? How should we determine which documents most closely match the intent of the query?

In this post I explain the basic ideas of how to rank different documents according to their relevance. The ideas used are very beautiful. They are based on the fearsome-sounding vector space model for documents. Although it sounds fearsome, the vector space model is actually very simple. The key idea is to transform search from a linguistic problem into a geometric problem. Instead of thinking of documents and queries as strings of letters, we adopt a point of view in which both documents and queries are represented as vectors in a vector space. In this point of view, the problem of determining how relevant a document is to a query is just a question of determining how parallel the query vector and the document vector are. The more parallel the vectors, the more relevant the document is.

This geometric way of treating documents turns out to be very powerful. It’s used by most modern web search engines, including (most likely) web search engines such as Google and Bing, as well as search libraries such as Lucene. The ideas can also be used well beyond search, for problems such as document classification, and for finding clusters of related documents. What makes this approach powerful is that it enables us to bring the tools of geometry to bear on the superficially very non-geometric problem of understanding text.

Very much looking forward to future posts in this series. There is no denying the power of “vector space model” but that leaves unasked what is lost in the transition from linguistic to geometric space?

January 24, 2012

CS 101: Build a Search Engine

Filed under: CS Lectures,Search Engines — Patrick Durusau @ 3:41 pm

CS 101: Build a Search Engine

David Evans and Sebastian Thrun teach CS 101 by teaching students how to build a search engine.

There is an outline syllabus but not any more detail at this time.

January 19, 2012

Beepl Launches A Twitter-Simple, “Social Q&A Site”

Filed under: Search Engines — Patrick Durusau @ 7:42 pm

Beepl Launches A Twitter-Simple, “Social Q&A Site” by Kit Eaton.

From the post:

People, meet Beepl. It launched to the general public yesterday in the online expertise-sharing/question-and-answer sphere after a short private test run. Branding itself as a “social Q&A site” that “lets users seek answers and opinion from subject specialists, enthusiasts and their social graph,” Beepl also “understands the topics that questions relate to and users’ interests and expertise so that questions automatically reach the best people to reach them.” That bit of lateral thinking differentiates Beepl in a pretty bustling market, but it’s only one of the novel surprises from the company (starting with the lack of a launch press release).

There was this exchange with the founder, Steve O’Hear:

….How can you trust that it’ll connect you to something interesting to you, or perhaps something you have vital insight into for others? Does it mean you may miss out on fringe questions about things you never knew about, but may be fascinated by?

Beepl addresses this, O’Hear says, because the “most aggressive part is for people that are actively using the site. It looks at questions you’ve clicked on, any you’ve answered, any you’ve asked. It even takes a tiny amount from if you do a search on the site.”

I guess if you think politicians really answer each other in debates you could consider that to be a response. 😉 Well, from a dialogue standpoint it was a response but it wasn’t a very helpful one.

From a topic map standpoint, how would you go about mapping the stream of questions and answers? Clues you would look for? Not quite as short as tweets. Enough for context?

January 17, 2012

The Long Tail of Semantics

Filed under: Search Engines,Searching,Semantics — Patrick Durusau @ 8:09 pm

It came up in a conversation with Sam Hunting recently that search engines are holding a large end of a long tail of semantics. Well, that is how I would summarize almost 30 minutes of hitting at and around the idea!

Think about it, search engines by their present construction, are bound to a large end of a long tail of search results. That is the end of the long tail that they report to users, with varying degrees of filtering and enhancement, not to mention paid ads.

Problem: The long tail of semantics hasn’t been established in general, for some particular set of terms and certainly not for any particular user. Opps. (as Rick Perry would say)

And each search result represents some unknown position in some long tail of semantics for a particular user. Opps, again.

Search engines do well enough to keep users coming back, so they are hitting some part of the long tail of semantics, they just don’t know what part for any particular user.

I am sure it is easier to count occurrences, queries and the like and trust that the search engine is hitting high enough somewhere on the long tail to justify ad rates.

But what if we could improve that? That is not be banging around somewhere on a long tail of semantics in general but some particular sub-tail of semantics.

For example, we know when terminology is being taken from an English language journal on cardiology. We have three semantic indicators, English as a language, journal as means of publication and cardiology as a subject area. What is more, we can discover without too much difficulty, the papers cited by authors of that journal. Which more likely than note would be recognized by other readers of that journal. So what if we kept the results from that area segregated from other search results and did the same (virtually) with other recognized areas. (Mathematics for example have varying terms even within disciplines, set theory for example, so work would be left to be done.)

Rather than putting search results together and later trying to disambiguate them, start that process at the beginning and preserve as much data as we can that may help distinguish part of a long tail into smaller ones.

(This sounds like “personalization” to me as I write it but personalization has its own hazards and dangers. Some of which can be avoided by asking a librarian. More on that another time.)

January 14, 2012

ToChildBlockJoinQuery in Lucene

Filed under: Lucene,Search Engines — Patrick Durusau @ 7:34 pm

ToChildBlockJoinQuery in Lucene .

Mike McCandless writes:

In my last post I described a known limitation of BlockJoinQuery: it joins in only one direction (from child to parent documents). This can be a problem because some applications need to join in reverse (from parent to child documents) instead.

This is now fixed! I just committed a new query, ToChildBlockJoinQuery, to perform the join in the opposite direction. I also renamed the previous query to ToParentBlockJoinQuery.

This will included in Lucene 3.6.0 and 4.0.

Custom Search JavaScript API is now fully documented!

Filed under: Google CSE,Javascript,Search Engines,Searching — Patrick Durusau @ 7:33 pm

Custom Search JavaScript API is now fully documented!

Riona MacNamara writes:

The Custom Search engineers spent 2011 launching great features. But we still hear from our users that our documentation could do with improvement. We hear you. Today we’re launching some updates to our docs:

  • Comprehensive JavaScript reference for the Custom Search Element. We’ve completely overhauled our Custom Search Element API documentation to provide a comprehensive overview of all the JavaScript methods available. We can’t wait to see what you build with it.
  • More languages. The Help Center is now available in Danish, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Spanish, and Swedish.
  • Easier navigation and cleaner design. We’ve reorganized the Help Center to make it easier to find the information you’re looking for. Navigation is simpler and more streamlined. Individual articles have been revised and updated, and designed to be more readable.

Documentation is an ongoing effort, and we’ll be continuing to improve both our Help Center and our developer documentation. If you have comments or suggestions, we’d love to see them in our user forum.

Granting that a Google CSE could give you more focused results (along with ads), but don’t you still have the problem of re-using search results?

It’s a good thing that users can more quickly find relevant content in a particular domain, but do you really want your users searching for the same information over and over again?

Hmmm, what if you kept a search log with the “successful” results as chosen by your users? That could be a start in terms of locating subjects and information about them that is important to your users. Subjects that could then be entered into your topic map.

January 13, 2012

OpenSearch.org

Filed under: OpenSearch.org,Search Engines,Searching — Patrick Durusau @ 8:14 pm

OpenSearch.org

From the webpage:

OpenSearch is a collection of simple formats for the sharing of search results.

The OpenSearch description document format can be used to describe a search engine so that it can be used by search client applications.

The OpenSearch response elements can be used to extend existing syndication formats, such as RSS and Atom, with the extra metadata needed to return search results.

I like this line from the FAQ:

Different types of content require different types of search engines. The best search engine for a particular type of content is frequently the search engine written by the people that know the content the best.

Not a lot of people using the OpenSearch description document but I wonder if you could write something similar for websites? That is declare for a website ((your own or someone else’s), what is to be found there. Or what vocabulary is used there.

With just a small boost from their users, search engines could do a lot better in terms of producing sane results.

Will have to think about a way to test the declaration of a vocabulary for a website or even group of websites and comparing that to a search without the benefit of such a vocabulary. Suggestions welcome!

January 2, 2012

Web Search – It’s Worse Than You Think

Filed under: Search Engines,Searching — Patrick Durusau @ 9:50 am

Web Search – It’s Worse Than You Think

Matthew Hurst writes (in part):

While it seems like everything in the online space is hunky dory and progress is making predictable strides towards our inevitable AI infested future, I often see such utter failures in search engine results that makes me think we haven’t even started to lay the foundations.

Here’s the story: as I’ve become interested in mining the news cycle for various reasons, I’ve started attempting to understand who the editors of major news sources are. The current version of the Hapax Page on d8taplex tracks the attribution of article authors and editors (I conflate the concept of writer, reporter and un-typed contributors under the term ‘author’ while explicit editors are tracked separately). From this analysis, I see that there is someone called Cynthia Johnston who is often associated with articles from Reuters (in fact, she is currently at the top of the list ranked by count of articles).

You need to read his post in full to get the real flavor of his experience with the Cynthia Johnston search request.

Two quick points:

+1 to we have not laid the foundations for adequate searching. Not surprising since I don’t think we understand what adequate “searching” means in a multi-semantic context such as the WWW. Personally I don’t think we understand searching in a mono-semantic context but but is a separate issue.

As to his blog post changing the search experience for anyone seeking information on Cynthia Johnston, do we need to amend:

Observer effect (information technology)

or

Observer effect (physics)

at Wikipedia, or do we need a new subject:

Observer effect (search technology)?

December 17, 2011

Google removes more search functionality

Filed under: Advertising,Search Engines,Search Interface,Searching — Patrick Durusau @ 6:32 am

Google removes more search functionality by Phil Bradley.

From the post:

In Google’s apparently lemming like attempt to throw as much search functionality away as they can, they have now revamped their advanced search page. Regular readers will recall that I wrote about Google making it harder to find, and now they’re reducing the available options. The screen is now following the usual grey/white/read design, but to refresh your memory, this is what it used to look like:

Just in case you are looking for search opportunities in the near future.

The smart money says to not try to be everything to everybody. Pick off a popular (read advertising supporting) subpart of all content and work up really well. Offer users for that area what seem like useful defaults for that area. The defaults for television/movie types are likely to be different from the Guns & Ammo crowd. As would the advertising you would sell.

Remind me to write about using topic maps to create pull-model advertising. So that viewers pre-qualify themselves and you can charge more for “hits” on ads.

December 14, 2011

Nutch Tutorial: Supplemental III

Filed under: Nutch,Search Engines,Searching — Patrick Durusau @ 7:45 pm

Apologies for the diversion in Nutch Tutorial: Supplemental II.

We left off last time with a safe way to extract the URLs from the RDF text without having to parse the XML and without having to expand the file onto the file system. And we produced a unique set of URLs.

We still need a random set of URLs, 1,000 was the amount mentioned in the Nutch Tutorial at Option 1.

Since we did not parse the RDF, we can’t use the subset option for org.apache.nutch.tools.DmozParser.

So, back to the Unix command line and our file with 3838759 lines in it, each with a unique URL.

Let’s do this a step at a time and we can pipe it all together below.

First, our file is: dmoz.urls.gz, so we expand it with gunzip:

gunzip dmoz.urls.gz

Results in dmoz.urls

The we run the shuf command, which randomly shuffles the lines in the file:

shuf dmoz.urls > dmoz.shuf.urls

Remember the < command pipes the results to another file.

Now the lines are in random order. But it is still the full set of URLs.

So we run the head command to take the first 1,000 lines off of our now randomly sorted file:

head -1000 dmoz.shuf.urls > dmoz.shuf.1000.urls

So now we have a file with 1,000 randomly chosen URLs from our DMOZ source file.

Here is how to do all that in one line:

gunzip -c dmoz.urls.gz | shuf | head -1000 > dmoz.shuf.1000.urls

BTW, in case you are worried about the randomness of your set, so many of us are not hitting the same servers with our test installations, don’t be.

I ran shuf twice in a row on my set of URL and then ran diff, which reported the first 100 lines were in a completely different order.

BTW, to check yourself on the extracted set of 1,000 URLs, run the following:

wc -l dmoz.shuf.1000.urls

Result should be 1000.

The wc command prints newline, word and byte counts. With the -l option, it prints new line counts.

In case you don’t have the shuf command on your system, I would try:

sort -R dmoz.urls > dmoz.sort.urls

as a substitute for shuf dmoz.urls > dmoz.shuf.urls

Hillary Mason (source of the sort suggestion, has collected more extract one line (not exactly our requirement but you can be creative) at: How to get a random line from a file in bash.

I am still having difficulties with one way to use Nutch/Solr so we will cover the “other” path, the working one, tomorrow. It looks like a bug between versions and I haven’t found the correct java class to copy over at this point. Not like a tutorial to mention that sort of thing. 😉

December 13, 2011

Which search engine when?

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 9:51 pm

Which search engine when?

A listing of search engines in the following categories:

  • keyword search
  • index or directory based
  • multi or meta search engines
  • visual results
  • category
  • blended results

There are fifty-three (53) entries so plenty to choose from if you are bored with your current search “experience.”

Not to mention learning about different ways to present search results to users.

BTW, if you run across a blog mentioning that AllPlus was listed in two separate categories, like this one, realize that SearchLion was also listed in two separate categories.

Search engines are an important topic for topic mappers because it is one of the places where semantic impedance and the lack of useful organization of information is a major time sink for all users.

Getting 400,000 “hits” is just a curiosity, getting 402 “hits,” in a document archive like I did this morning, is a considerable amount of content but a manageable one.

No, it wasn’t a topic map that I was searching but the results may well find themselves into a topic map.

December 12, 2011

NLM Plus

Filed under: Bioinformatics,Biomedical,Search Algorithms,Search Engines — Patrick Durusau @ 10:22 pm

NLM Plus

From the webpage:

NLMplus is an award winning Semantic Search Engine and Biomedical Knowledge Base application that showcases a variety of natural language processing tools to provide an improved level of access to the vast collection of biomedical data and services of the National Library of Medicine.

Utilizing its proprietary Web Knowledge Base, WebLib LLC can apply the universal search and semantic technology solutions demonstrated by NLMplus to libraries, businesses, and research organizations in all domains of science and technology and Web applications

Any medical librarians in the audience? Or ones you can forward this post to?

Curious what professional researchers make of NLM Plus? I don’t have the domain expertise to evaluate it.

Thanks!

Nutch Tutorial: Supplemental II

Filed under: Nutch,Search Engines,Searching — Patrick Durusau @ 10:20 pm

This continues Nutch Tutorial: Supplemental.

I am getting a consistent error from:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

and have posted to the solr list, although my post has yet to appear on the list. More on that in my next post.

I wanted to take a quick detour into 3.2 Using Individual Commands for Whole-Web Crawling as it has some problematic advice in it.

First, Open Directory Project data can be downloaded from How to Get ODP Data. (Always nice to have a link, I think they call them hyperlinks.)

Second, as of last week, the content.rdf.u8.gz file is 295831712. Something about the file should warn you that gunzip on this file is a very bad idea.

A better one?

Run: gunzip -l (the switch is lowercase “l” as in Larry) which delivers the following information:

compressed size: size of the compressed file
uncompressed size: size of the uncompressed file
ratio: compression ratio (0.0% if known)
uncompressed_name: name of the uncompressed file

Or, in this case:

gunzip -l content.rdf.u8.gz
compressed uncompressed ratio uncompressed_name
295831712 1975769200 85.0% content.rdf.u8

Yeah, that 1 under uncompressed is in the TB column. So just a tad shy of 2 TB of data.

Not everyone who is keeping up with search technology has a couple of spare TBs of drive space lying around, although it is becoming more common.

What got my attention was the lack of documentation of the file size or potential problems such a download could cause causal experimenters.

But, if we are going to work with this file we have to do so without decompressing it.

Since the tutorial only extracts URLs, I am taking that as the initial requirement although we will talk about more sophisticated requirements in just a bit.

On a *nix system it is possible to move the results of a command to another command by what are called pipes. My thinking in this case was to use the decompress command not to decompress the file but to decompress it and send the results of that compression to another command that would extract the URLs. After I got that part to working, I sorted and then deduped the URL set.

Here is the command with step [step] numbers that you should remove before trying to run it:

[1]gunzip -c content.rdf.u8.gz [2]| [3]grep -o ‘http://[^”]*’ [2]| [4]sort [2]| [5]uniq [6]> [7]dmoz.urls

  1. gunzip -c content.rdf.u8.gz – With the -c switch, gunzip does not change the original file but streams the uncompressed content of the file to standard out. This is our starting point for dealing with files too large to expand.
  2. | – This is the pipe character that moves the output of one command to be used by another. The shell command in this case has three (3) pipe commands.
  3. grep -o ‘http://[^”]*’ – With the -o switch, grep will print on the “matched” parts of a matched line (grep normally prints the entire line), with each part on a different line. The ‘http://[^”]*’ is a regular expression looking for parts that start with http:// and proceed to match any character other than the doublequote mark. When the double quote mark is reached, the match is complete and that part prints. Note the use of the wildcard character “*” which allows any number of charaters up to the closing double quote. The entire expression is surrounded with single ” ‘ ” characters because it contains a double quote character.
  4. sort – The result of #3 is piped into #4, where it is sorted. The sort was necessary because of the next command in in the pipe.
  5. uniq – The sorted result is delivered to the uniq command which deletes any duplicate URLs. A requirement for the uniq command is that the duplicates be located next to each other, hence the sort command.
  6. > Is the command to write the results of the uniq command to a file.
  7. dmoz.urls Is the file name for the results.

The results were as follows:

  • dmoz.urls = 130,279,429 – Remember the projected expansion of the original was 1,975,769,200 or 1,845,489,771 larger.
  • dmoz.urls.gz = 27,832,013 – The original was 295,831,712 or 267,999,699 larger.
  • unique urls – 3,838,759 (I have no way to compare to the original)

Note that it wasn’t necessary to process the RDF in order to extract a set of URLs for seeding a search engine.

Murray Altheim made several very good suggestions with regard to Java libraries and code for this task. Those don’t appear here but will appear in a command line tool for this dataset that allows the user to choose categories of websites to be extracted for seeding a search engine.

All that is preparatory to a command line tool for creating a topic map from a selected part of this data set and then enhancing it with the results of the use of a search engine.

Apologies for getting off track on the Nutch tutorial. There are some issues that remain to be answers, typos and the like, which I will take up in the next post on this subject.

December 9, 2011

dmoz – open directory project

Filed under: Search Data,Search Engines — Patrick Durusau @ 8:25 pm

dmoz – open directory project

This came up in the discussion of the Nutch Tutorial and I thought it might be helpful to have an entry on the site.

It is a collection of hand-edited resources which as of today claims:

4,952,266 sites – 92,824 editors – over 1,008,717 categories

The information you will find under the “help” menu item will be very valuable as you learn to make sure of the data files from this source.

December 7, 2011

YaCy Search Engine

Filed under: Search Engines,Webcrawler — Patrick Durusau @ 8:11 pm

YaCy – Decentralized Web Search

Has anyone seen this?

From the homepage:

YaCy is a free search engine that anyone can use to build a search portal for their intranet or to help search the public internet. When contributing to the world-wide peer network, the scale of YaCy is limited only by the number of users in the world and can index billions of web pages. It is fully decentralized, all users of the search engine network are equal, the network does not store user search requests and it is not possible for anyone to censor the content of the shared index. We want to achieve freedom of information through a free, distributed web search which is powered by the world’s users.

Limited demo interface: http://search.yacy.net/

Interesting idea.

It would be more interesting if it used a language that permitted dynamic updating of software while it is running. Otherwise, you are going to have the YaCy search engine you installed and nothing more.

Reportedly Google improves its search algorithm many times every quarter. How many of those changes are ad-driven they don’t say.

The documentation for YaCy is slim at best. Particularly on technical details. For example, uses a NoSQL database. OK, a custom one or one of the standard ones? I could go on but it would not produce any answers. As I explore the software I will post what I find out about it.

December 1, 2011

elasticsearch version 0.18.5

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 7:40 pm

elasticsearch version 0.18.5

From the blog entry:

You can download it here. It includes an upgraded Lucene version (3.5), featuring bug fixes and memory improvements, as well as more bug fixes in elasticsearch itself. Changes can be found here.

November 27, 2011

Siri’s Sibling Launches Intelligent Discovery Engine

Filed under: Agents,Artificial Intelligence,Search Engines,Searching — Patrick Durusau @ 8:56 pm

Siri’s Sibling Launches Intelligent Discovery Engine

Completely unintentional but I ran across this article that concerns Siri as well:

We’re all familiar with the standard search engines such as Google and Yahoo, but there is a new technology on the scene that does more than just search the web – it discovers it.

Trapit, which is a personalized discovery engine for the web that’s powered by the same artificial intelligence technology behind Apple’s Siri, launched its public beta last week. Just like Siri, Trapit is a product of the $200 million CALO Project (Cognitive Assistant that Learns and Organizes), which was the largest artificial intelligence project in U.S. history, according to Mashable. This million-dollar project was funded by DARPA (Defense Advanced Research Projects Agency), the Department of Defense’s research arm.

Trapit, which was first unveiled in June, is a system that personalizes content for its users based on keywords, URLs and reading habits. This service, which can identify related content based on contextual data from more than 50,000 sources, provides a simple, carefree way to discover news articles, images, videos and other content on specific topics.

So, I put in keywords and Trapit uses those to return content to me, which if I then “trapit,” the system will continue to hunt for related content. Yawn. Stop me if you have heard this story before.

Keywords? That’s what we get from “…the largest artificial intelligence project in U.S. history?”

From Wikipedia on CALO:

Its five-year contract brought together 300+ researchers from 25 of the top university and commercial research institutions, with the goal of building a new generation of cognitive assistants that can reason, learn from experience, be told what to do, explain what they are doing, reflect on their experience, and respond robustly to surprise.

And we got keywords. Which Trapit uses to feed back similar content to us. I don’t need similar content, I need content that doesn’t use my keywords and yet is relevant to my query.

But rather than complain, why not build a topic map system based upon “…cognitive assistants that can reason, learn from experience, be told what to do, explain what they are doing, reflect on their experience, and respond robustly to surprise.” Err. that would be crowdsourcing topic map authoring, yes?

« Newer PostsOlder Posts »

Powered by WordPress