Archive for the ‘Search Analytics’ Category

Google search poisoning – old dogs learn new tricks

Tuesday, July 7th, 2015

Google search poisoning – old dogs learn new tricks by Dmitry Samosseiko.

From the post:

These days, every company knows that having its website appear at the top of Google’s results for relevant keyword searches makes a big difference in traffic and helps the business. Numerous search engine optimization (SEO) techniques have existed for years and provided marketers with ways to climb up the PageRank ladder.

In a nutshell, to be popular with Google, your website has to provide content relevant to specific search keywords and also to be linked to by a high number of reputable and relevant sites. (These act as recommendations, and are rather confusingly known as “back links,” even though it’s not your site that is doing the linking.)

Google’s algorithms are much more complex than this simple description, but most of the optimization techniques still revolve around those two goals. Many of the optimization techniques that are being used are legitimate, ethical and approved by Google and other search providers. But there are also other, and at times more effective, tricks that rely on various forms of internet abuse, with attempts to fool Google’s algorithms through forgery, spam and even hacking.

One of the techniques used to mislead Google’s page indexer is known as cloaking. A few days ago, we identified what we believe is a new type of cloaking that appears to work very well in bypassing Google’s defense algorithms.

Dmitry reports that Google was notified of this new form of cloaking so it may be work for much longer.

I first read about this in Notes from SophosLabs: Poisoning Google search results and getting away with it by Paul Ducklin.

I’m not sure I would characterize this as “poisoning Google search.” Altering a Google search result to be sure but poisoning implies that standard Google search results represent some “standard” of search results. Google search results are the outcome of undisclosed algorithms run on undisclosed content, subject to undisclosed processing of the scores from processing content with algorithms, and output with more undisclosed processing of the results.

Just putting it into large containers, I see four large boxes of undisclosed algorithms and content, all of which impact the results presented as Google Search results. Are Google Search results the standard output from four or more undisclosed processing steps of unknown complexity?

That doesn’t sound like much of a standard to me.

You?

Maybe Friday (17th April) or Monday (20th April) DARPA – Dark Net

Wednesday, April 15th, 2015

Memex In Action: Watch DARPA Artificial Intelligence Search For Crime On The ‘Dark Web’ by Thomas Fox-Brewster.

Is DARPA’s Memex search engine a Google-killer? by Mark Stockleyhttps

A couple of “while you wait” pieces to read while you expect part of the DARPA Memex project to appear on its Open Catalog page, either this coming Friday (17th of April) or Monday (20th of April).

Fox-Brewster has video of a part of the system that:

It is trying to overcome one of the main barriers to modern search: crawlers can’t click or scroll like humans do and so often don’t collect “dynamic” content that appears upon an action by a user.

If you think searching is difficult now, with an estimated 5% of the web being indexed, just imagine bumping that up 10X or more.

Entirely manual indexing is already impossible and you have experienced the short comings of page ranking.

Perhaps the components of Memex will enable us to step towards a fusion of human and computer capabilities to create curated information resources.

Imagine an electronic The Art of Computer Programming that has several human experts per chapter who are assisted by deep searching and updating references and the text on an ongoing basis? So readers don’t have to weed through all the re-inventions of particular algorithms across numerous computer and math journals.

Or perhaps a more automated search of news reports so the earliest/most complete report is returned with the notation: “There are NNNNNN other, later and less complete versions of this story.” It isn’t that every major paper adds value, more often just content.

BTW, the focus on the capabilities of the search engine, as opposed to the analysis of those results most welcome.

See my post on its post-search capabilities: DARPA Is Developing a Search Engine for the Dark Web.

Looking forward to Friday or Monday!

Introducing Splainer…

Monday, August 25th, 2014

Introducing Splainer — The Open Source Search Sandbox That Tells You Why by Doug Turnbull.

Splainer is a step towards addressing two problems:

From the post:

  • Collaboration: At OpenSource Connections, we believe that collaboration with non-techies is the secret ingredient of search relevancy. We need to arm business analysts and content experts with a human readable version of the explain information so they can inform the search tuning process.
  • Usability: I want to paste a Solr URL, full of query paramaters and all, and go! Then, once I see more helpful explain information, I want to tweak (and tweak and tweak) until I get the search results I want. Much like some of my favorite regex tools. Get out of the way and let me tune!
  • ….

    We hope you’ll give it a spin and let us know how it can be improved. We welcome your bugs, feedback, and pull requests. And if you want to try the Splainer experience over multiple queries, with diffing, results grading, a develoment history, and more — give Quepid a spin for free!

Improving the information content of the tokens you are searching is another way to improve search results.

What “viable search engine competition” really looks like

Sunday, January 5th, 2014

What “viable search engine competition” really looks like by Alex Clemmer.

From the post:

Hacker News is up in arms again today about the RapGenius fiasco. See RapGenius statement and HN comments. One response article argues that we need more “viable search engine competition” and the HN community largely seems to agree.

In much of the discussion, there is a picaresque notion that the “search engine problem” is really just a product problem, and that if we try really hard to think of good features, we can defeat the giant.

I work at Microsoft. Competing with Google is hard work. I’m going to point out some of the lessons I’ve learned along the way, to help all you spry young entrepreneurs who might want to enter the market.

Alex has six (6) lessons for would-be Google killers:

Lesson 1: The problem is not only hiring smart people, but hiring enough smart people.

Lesson 2: competing on market share is possible; relevance is much harder

Lesson 3: social may pose an existential threat to Google’s style of search

Lesson 4: large companies have access to technology that is often categorically better than OSS state of the art

Lesson 5: large companies are necessarily limited by their previous investments

Lesson 6: large companies have much more data than you, and their approach to search is sophisticated

See Alex’s post for the details under each lesson.

What has always puzzled me is why compete on general search? General search services are “free” save for the cost of a users time to mine the results. It is hard to think of a good economic model to compete with “free.” Yes?

If we are talking about medical, legal, technical, engineering search, where services are sold to professionals and the cost is passed onto consumers, that could be a different story. Even there, costs have to be offset by a reasonable expectation of profit against established players in each of those markets.

One strategy would be to supplement or enhance existing search services and pitch that to existing market holders. Another strategy would be to propose highly specialized searching of unique data archives.

Do you think Alex is right in saying “…most traditional search problems have really been investigated thoroughly”?

I don’t because of the general decline in information retrieval from the 1950’s-1960’s to date.

If you doubt my observation, pick a Readers’ Guide to Periodical Literature (hard copy) for 1968 and choose some subject at random. Repeat that exercise with the search engine of your choice, limiting your results to 1968.

Which one gave you more relevant references for 1968, including synonyms? Say in the first 100 entries.

I first saw this in a tweet by Stefano Bertolo.

PS: I concede that the analog book does not have digital hyperlinks to take you to resources but it does have analog links for the same purpose. And it doesn’t have product ads. 😉

Prostitutes Appeal to Pope: Text Analytics applied to Search

Sunday, April 29th, 2012

Prostitutes Appeal to Pope: Text Analytics applied to Search by Tony Russell-Rose.

It is hard for me to visit Tony’s site and not come away with several posts he has written that I want to mention. Today was no different.

Here is a sampling of what Tony talks about in this post:

Consider the following newspaper headlines, all of which appeared unambiguous to the original writer:

  • DRUNK GETS NINE YEARS IN VIOLIN CASE
  • PROSTITUTES APPEAL TO POPE
  • STOLEN PAINTING FOUND BY TREE
  • RED TAPE HOLDS UP NEW BRIDGE
  • DEER KILL 300,000
  • RESIDENTS CAN DROP OFF TREES
  • INCLUDE CHILDREN WHEN BAKING COOKIES
  • MINERS REFUSE TO WORK AFTER DEATH

Although humorous, they illustrate much of the ambiguity in natural language, and just how much pragmatic and linguistic knowledge must be employed by NLP tools to function accurately.

A very informative and highly amusing post.

What better way to start the week?

Enjoy!

Relevance Tuning and Competitive Advantage via Search Analytics

Sunday, January 8th, 2012

Relevance Tuning and Competitive Advantage via Search Analytics

It must be all the “critical” evaluation of infographics I have been reading but I found myself wondering about the following paragraph:

This slide shows how Search Analytics can be used to help with A/B testing. Concretely, in this slide we see two Solr Dismax handlers selected on the right side. If you are not familiar with Solr, think of a Dismax handler as an API that search applications call to execute searches. In this example, each Dismax handler is configured differently and thus each of them ranks search hits slightly differently. On the graph we see the MRR (see Wikipedia page for Mean Reciprocal Rank details) for both Dismax handlers and we can see that the one corresponding to the blue line is performing much better. That is, users are clicking on search hits closer to the top of the search results page, which is one of several signals of this Dismax handler providing better relevance ranking than the other one. Once you have a system like this in place you can add more Dismax handlers and compare 2 or more of them at a time. As the result, with the help of Search Analytics you get actual, real feedback about any changes you make to your search engine. Without a tool like this, you cannot really tune your search engine’s relevance well and will be doing it blindly.

Particularly the line:

That is, users are clicking on search hits closer to the top of the search results page, which is one of several signals of this Dismax handler providing better relevance ranking than the other one.

Really?

Here is one way to test that assumption:

Report for any search as the #1 or #2 result, “private cell-phone number for …” and pick one of the top ten movie actresses for 2011. And you can do better than that, make sure the cell-phone number is one that rings at your search analytics desk. Now see how many users are “…clicking on search hits closer to the top of the search results page….”

Are your results more relevant than a movie star?

Don’t get me wrong, search analytics are very important, but let’s not get carried away about what we can infer from largely opaque actions.

Some other questions: Did users find the information they needed? Can they make use of that information? Does that use improve some measurable or important aspect of the company business? Let’s broaden search analytics to make search results less opaque.

Search Analytics

Sunday, November 6th, 2011

Search Analytics

From the post:

Here is another take on Search Analytics, this one being presented at Enterprise Search Summit Fall 2011 in Washington DC, to an audience coming mainly from the US government agencies, very large enterprises, and large international companies with 10s of thousands of employees world wide. The audience was good and posed a number of good questions after the talk. The full slide deck is below as well as in Sematext@Slideshare.

I like the:

If you can’t measure it, you can’t fix it! [emphasis in original, I did fix the punctuation to move the comma from “measure, it” to “measure it,”.]

line. Although I would have liked it better when I was an undergraduate student taking empirical methodology in political science. A number of years later I still agree that measurement is important but am less militant that measurement is always possible or even useful.

Still, a very good slide deck and a good way to start off the week!

Search Analytics for Your Site

Thursday, October 20th, 2011

Search Analytics for Your Site

From the website:

Any organization that has a searchable web site or intranet is sitting on top of hugely valuable and usually under-exploited data: logs that capture what users are searching for, how often each query was searched, and how many results each query retrieved. Search queries are gold: they are real data that show us exactly what users are searching for in their own words. This book shows you how to use search analytics to carry on a conversation with your customers: listen to and understand their needs, and improve your content, navigation and search performance to meet those needs.

I haven’t read this book so don’t take this post as an endorsement or “buy” recommendation.

While watching the slide deck, it occurred to me that if search analytics could improve your website, why not use search analytics to develop the design and content of a topic map?

The design aspect in the sense that the most prominent, easiest to use/find content is what is popular with users. Could even be by time of the day if you have a topic map that is accessible 24 x 7.

The content aspect in the sense of what is included, what we say about it and perhaps how it is findable is based on search analysis.

If you were developing a topic map about Sarah Palin, perhaps searching for “dude” should return her husband as a topic. I can think of other nicknames but this isn’t a political blog.

Comments on this book or suggestions of other search analytics resources appreciated.