Archive for the ‘Search Data’ Category

Wouldn’t it be fun to build your own Google?

Thursday, December 11th, 2014

Wouldn’t it be fun to build your own Google? by Martin Kleppmann.

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

Undated Search Results

Tuesday, December 24th, 2013

Looking for HTML5 resources to mention in Design, Math, and Data was complicated by the lack of dating in search results.

Searching on “html5 interfaces examples,” the highest ranked result was:

HTML5 Website Showcase: 48 Potential Flash-Killing Demos (2009, est.)

That’s right, a four year old post.

Never mind the changes in CSS, jQuery, etc. over the last four years.

Several pages into the first search results I found:

40+ Useful HTML5 Examples and Tutorials (2012)

It was in a mixture of undated or variously dated resources.

Finally, after following an old post and then searching that site, I uncovered:

21 Fresh Examples of Websites Using HTML5 (2013)

Even there it wasn’t the highest ranked page at the site.

I realize that parsing dates for sites could be difficult but surely search engines know the date when they first encountered a page? That would make it trivial to order search results by time.

Pages would not have a strict chronological sequence but a better time sorting than the current time hodgepodge of results.

Google Transparency Report

Saturday, December 21st, 2013

Google Transparency Report

The Google Transparency Report consists of five parts:

  1. Government requests to remove content

    A list of the number of requests we receive from governments to review or remove content from Google products.

  2. Requests for information about our users

    A list of the number of requests we received from governments to hand over user data and account information.

  3. Requests by copyright owners to remove search results

    Detailed information on requests by copyright owners or their representatives to remove web pages from Google search results.

  4. Google product traffic

    The real-time availability of Google products around the world, historic traffic patterns since 2008, and a historic archive of disruptions to Google products.

  5. Safe Browsing

    Statistics on how many malware and phishing websites we detect per week, how many users we warn, and which networks around the world host malware sites.

I pointed out the visualizations of the copyright holder data earlier today.

There are a number of visualizations of the Google Transparency Report and I may assemble some of the more interesting ones for your viewing pleasure.

You certainly should download the data sets and/or view them as Google Docs Spreadsheets.

I say that because while Google is more “transparent” than the current White House, it’s not all that transparent at all.

Take the government take down requests for example.

According to the raw data file, the United States has made five (5) requests on the basis of national security, four (4) of which were for YouTube videos and one (1) was for one web search result.


And for no government request, is there sufficient information to identify the information that any government sought to conceal.

Google may have qualms about information governments want to conceal but that sounds like a marketing opportunity to me. (Being mindful of your availability to such governments.)

A Look Inside Our 210TB 2012 Web Crawl

Wednesday, August 21st, 2013

A Look Inside Our 210TB 2012 Web Crawl by Lisa Green.

From the post:

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!

Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.

From the conclusion section of the paper:

The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from, blog publishing services like and as well as online shopping sites such as These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets.

View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://aws-publicdatasets/common-crawl/index2012 and the code on GitHub.

Don’t have your own server farm crawling the internet?

Take a long look at CommonCrawl and their publicly accessible crawl data.

If the enterprise search bar is at 9%, the Internet search bar is even lower.

Use CommonCrawl data as a practice field.

Does your first ten “hits” include old data because it is popular?

Ultimate library challenge: taming the internet

Saturday, April 6th, 2013

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

URL Search Tool!

Wednesday, March 6th, 2013

URL Search Tool! by Lisa Green.

From the post:

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

Imagine that.

Searching relevant data instead of “big data.”

What a concept!

Is Google Hijacking Semantic Markup/Structured Data? [FALSE]

Saturday, January 26th, 2013

Is Google Hijacking Semantic Markup/Structured Data? by Barbara Starr.

From the post:

On December 12, 2012, Google rolled out a new tool, called the Google Data Highlighter for event data. Upon a cursory read, it seems to be a tagging tool, where a human trains the Data Highlighter using a few pages on their website, until Google can pick up enough of a pattern to do the remainder of the site itself.

Better yet, you can see all of these results in the structured data dashboard. It appears as if event data is marked up and is compatible with However, there is a caveat here that some folks may not notice.

No actual markup is placed on the page, meaning that none of the semantic markup using this Data Highlighter tool is consumable by Bing, Yahoo or any other crawler on the Web; only Google can use it!

Google is essentially hi-jacking semantic markup so only Google can take advantage of it. Google has the global touch and the ability to execute well-thought-out and brilliantly strategic plans.

Let’s do this by the numbers:

  1. Google develops a service for webmasters to add semantic annotations to their webpages.
  2. Google allows webmasters to use that service at no charge.
  3. Google uses those annotations to improve the search results it provides users (for free).

Google used its own resources to develop a valuable service for webmasters that enhances their websites and user experience with Google, for free.

Perhaps there is a new definition of highjacking?

Webster says the traditional definition includes “to steal or rob as if by hijacking.”

The Semantic Web:


(a) Failing to whitewash the Semantic Web’s picket fence while providing free services to webmasters and users to enhance searching of web content.

(b) Failing to give away data from free services to webmasters and users to those who did not plant, reap, spin, weave or sew.

I don’t find the Semantic Web’s definition of “hijacking” persuasive.


I first saw this at: Google’s Structured Data Take Over by Angela Guess.

Bitly Social Data APIs

Wednesday, January 9th, 2013

Bitly Social Data APIs by Hilary Mason.

From the post:

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

First, we share the analysis that we do at the link level….

Second, we’ve opened up access to a realtime search engine. …

Finally, we asked the question — what is the world paying attention to right now?…”bursting phrases”…

See Hilary’s post for the details, or even better, take a shot at the APIs!

I first saw this in a tweet by Dave Fauth.

QRU-1: A Public Dataset…

Saturday, September 8th, 2012

QRU-1: A Public Dataset for Promoting Query Representation and Understanding Research by Hang Li, Gu Xu, W. Bruce Croft, Michael Bendersky, Ziqi Wang and Evelyne Viegas.


A new public dataset for promoting query representation and understanding research, referred to as QRU-1, was recently released by Microsoft Research. The QRU-1 dataset contains reformulations of Web TREC topics that are automatically generated using a large-scale proprietary web search log, without compromising user privacy. In this paper, we describe the content of this dataset and the process of its creation. We also discuss the potential uses of the dataset, including a detailed description of a query reformulation experiment.

And the data set:

Query Representation and Understanding Set

The Query Representation and Understanding (QRU) data set contains a set of similar queries that can be used in web research such as query transformation and relevance ranking. QRU contains similar queries that are related to existing benchmark data sets, such as TREC query sets. The QRU data set was created by extracting 100 TREC queries, training a query-generation model and a commercial search engine, generating similar queries from TREC queries with the model, and removal of mistakenly generated queries.

Are query reformulations in essence different identifications of the subject of a search?

But the issue isn’t “more” search results but rather higher quality search results.

Why search engines bother (other than bragging rights) to report “hits” beyond the ones displayed isn’t clear. Just have a “next N hits” button.

You could consider the number of “hits” you don’t look at as a measure of your search engine’s quality. The higher the number…., well, you know. Could be gold in those “hits” but you will never know. And your current search engine will never say.

…10 billion lines of code…

Friday, July 20th, 2012

Also know as (aka):

Black Duck’s Ohloh lets data from nearly 500,000 open source projects into the wild by Chris Mayer.

From the post:

In a bumper announcement, Black Duck Software have embraced the FOSS mantra by revealing their equivalent of a repository Yellow Pages, through the Ohloh Open Data Initiative.

The website tracks 488,823 projects, allowing users to compare data from a vast amount of repositories and forges. But now, Ohloh’s huge dataset has been licensed under the Creative Commons Attribution 3.0 Unported license, encouraging further transparency across the companies who have already bought into Ohloh’s aggregation mission directive.

“Licensing Ohloh data under Creative Commons offers both enterprises and the open source community a new level of access to FOSS data, allowing trending, tracking, and insight for the open source community,” said Tim Yeaton, President and CEO of Black Duck Software.

He added: “We are constantly looking for ways to help the open source developer community and enterprise consumers of open source. We’re proud to freely license Ohloh data under this respected license, and believe that making this resource more accessible will allow contributors and consumers of open source gain unique insight, leading to more rapid development and adoption.”

What sort of insight would you expect to gain from “…10 billion lines of code…?”

How would you capture it? Pass it on to others in your project?

Mix or match semantics with other lines of code? Perhaps your own?

Virtual Documents: “Search the Impossible Search”

Friday, June 22nd, 2012

Virtual Documents: “Search the Impossible Search”

From the post:

The solution was to build an indexing pipeline specifically to address this user requirement, by creating “virtual documents” about each member of staff. In this case, we used the Aspire content processing framework as it provided a lot more flexibility than the indexing pipeline of the incumbent search engine, and many of the components that were needed already existed in Aspire’s component library.

[graphic omitted]

Merging was done selectively. For example, documents were identified that had been authored by the staff member concerned and from those documents, certain entities were extracted including customer names, dates and specific industry jargon. The information captured was kept in fields, and so could be searched in isolation if necessary.

The result was a new class of documents, which existed only in the search engine index, containing extended information about each member of staff; from basic data such as their billing rate, location, current availability and professional qualifications, through to a range of important concepts and keywords which described their previous work, and customer and industry sector knowledge.

Another tool to put in your belt but I wonder if there is a deeper lesson to be learned here?

Creating a “virtual” document, unlike anyone that existed in the target collection and indexing those “virtual” documents was a clever solution.

But it retains the notion of a “container” or “document” that is examined in isolation from all other “documents.”

Is that necessary? What are we missing if we retain it?

I don’t have any answers to those questions but will be thinking about them.


Precise data extraction with Apache Nutch

Sunday, April 1st, 2012

Precise data extraction with Apache Nutch By Emir Dizdarevic.

From the post:

Nutch’s HtmlParser parses the whole page and returns parsed text, outlinks and additional meta data. Some parts of this are really useful like the outlinks but that’s basically it. The problem is that the parsed text is too general for the purpose of precise data extraction. Fortunately the HtmlParser provides us a mechanism (extension point) to attach an HtmlParserFilter to it.

We developed a plugin, which consists of HtmlParserFilter and IndexingFilter extensions, which provides a mechanism to fetch and index the desired data from a web page trough use of XPath 1.0. The name of the plugin is filter-xpath plugin.

Using this plugin we are now able to extract the desired data from web site with known structure. Unfortunately the plugin is an extension of the HtmlParserFilter extension point which is hardly coupled to the HtmlParser, hence plugin won’t work without the HtmlParser. The HtmlParser generates its own metadata (host, site, url, content, title, cache and tstamp) which will be indexed too. One way to control this is by not including IndexFilter plugins which depend on the metadata data to generate the indexing data (NutchDocument). The other way is to change the SOLR index mappings in the solrindex-mapping.xml file (maps NutchDocument fields to SolrInputDocument field). That way we will index only the fields we want.

The next problem arises when it comes to indexing. We want that Nutch fetches every page on the site but we don’t want to index them all. If we use the UrlRegexFilter to control this we will loose the indirect links which we also want to index and add to our URL DB. To address this problem we developed another plugin which is a extension of the IndexingFilter extension point which is called index-omit plugin. Using this plugin we are able to omit indexing on the pages we don’t need.

Great post on precision and data extraction.

And a lesson that indexing more isn’t the same thing as indexing smarter.

A Twelve Step Program for Searching the Internet

Sunday, March 25th, 2012

OK, the real title is: Twelve steps to running your Ruby code across five billion web pages

From the post:

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

A call to action and an awesome post!

If you have ever forwarded a blog post, forward this one.

This would make a great short course topic. Will have to give that some thought.

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings

Monday, January 16th, 2012

Workshop on Entity-Oriented Search (EOS) – Beijing – Proceedings (PDF file)

There you will find:

Session 1:

  • High Performance Clustering for Web Person Name Disambiguation Using Topic Capturing by Zhengzhong Liu, Qin Lu, and Jian Xu (The Hong Kong Polytechnic University)
  • Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi-Modal CRF Model by Richard Tzong-Han Tsai (Yuan Ze University, Taiwan)
  • LADS: Rapid Development of a Learning-To-Rank Based Related Entity Finding System using Open Advancement by Bo Lin, Kevin Dela Rosa, Rushin Shah, and Nitin Agarwal (Carnegie Mellon University)
  • Finding Support Documents with a Logistic Regression Approach by Qi Li and Daqing He (University of Pittsburgh)
  • The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data by Stephane Campinas (National University of Ireland), Diego Ceccarelli (University of Pisa), Thomas E. Perry (National University of Ireland), Renaud Delbru (National University of Ireland), Krisztian Balog (Norwegian University of Science and Technology) and Giovanni Tummarello (National University of Ireland)

Session 2

  • Cross-Domain Bootstrapping for Named Entity Recognition by Ang Sun and Ralph Grishman (New York University)
  • Semi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery by Raymond Y.K. Lau and Wenping Zhang (City University of Hong Kong)
  • Unsupervised Related Entity Finding by Olga Vechtomova (University of Waterloo)

Session 3

  • Learning to Rank Homepages For Researcher-Name Queries by Sujatha Das, Prasenjit Mitra, and C. Lee Giles (The Pennsylvania State University)
  • An Evaluation Framework for Aggregated Temporal Information Extraction by Enrique Amigó, (UNED University), Javier Artiles (City University of New York), Heng Hi (City University of New York) and Qi Li (City University of New York)
  • Entity Search Evaluation over Structured Web Data by Roi Blanco (Yahoo! Research), Harry Halpin (University of Edinburgh), Daniel M. Herzig (Karlsruhe Institute of Technology), Peter Mika (Yahoo! Research), Jeffrey Pound (University of Waterloo), Henry S. Thompson (University of Edinburgh) and Thanh Tran Duc (Karlsruhe Institute of Technology)

A good start on what promises to be a strong conference series on entity-oriented search.

dmoz – open directory project

Friday, December 9th, 2011

dmoz – open directory project

This came up in the discussion of the Nutch Tutorial and I thought it might be helpful to have an entry on the site.

It is a collection of hand-edited resources which as of today claims:

4,952,266 sites – 92,824 editors – over 1,008,717 categories

The information you will find under the “help” menu item will be very valuable as you learn to make sure of the data files from this source.

Common Crawl

Wednesday, November 30th, 2011

Common Crawl

From the webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.

We strive to be transparent in all of our operations and we support nofollow and robots.txt. For more information about the ccBot, please see FAQ. For more information on Common Crawl data and how to access it, please see Data. For access to our open source code, please see our GitHub repository.

Current crawl is reported to be 5 billion pages. That should keep you hard drives spinning enough to help with heating in cold climes!

Looks like a nice place to learn a good bit about searching as well as processing serious sized data.

Enhancing search results using machine learning

Sunday, August 28th, 2011

Enhancing search results using machine learning by Emmanuel Espina

From the introduction:

To introduce you in the topic let’s think about how the users are used to work with “information retrieval platforms” (I mean, search engines). The user enters your site, sees a little rectangular box with a button that reads “search” besides it, and figures out that he must think about some keywords to describe what he wants, write them in the search box and hit search. Despite we are all very used to this, a deeper analysis of the workings of this procedure leads to the conclusion that it is a quite unintuitive procedure. Before search engines, the action of “mentally extracting keywords” from concepts was not a so common activity.

It is something natural to categorize things, to classify the ideas or concepts, but extracting keywords is a different intellectual activity. While searching, the user must think like the search engine! The user must think “well, this machine will give me documents with the words I am going to enter, so which are the words that have the best chance to give me what I want” (emphasis added)

Hmmmm, but prior to full-text search, users learned how to think like the indexers who created the index they were using. Indexers were a first line of defense against unbounded information as indexes covered particular resources and had mechanisms to account for changing terminology. Not to mention domain specific vocabularies that users could master.

A second line of defense were librarians who not only mastered domain specific indexes but who could also move from one specialized finding aid to another, collating information as they went. The ability to transition from one finding aid is one that has yet to be duplicated by automatic means. In part because it depends on the resources available in a particular library.

Do read the article to see how the author proposes to use machine learning to improve search results.

BTW, do you know of any sets of query responses that are publicly available?

Search Your Gmail Messages with ElasticSearch and Ruby

Thursday, May 19th, 2011

Search Your Gmail Messages with ElasticSearch and Ruby

From the website:

If you’d like to check out ElasticSearch, there’s already lots of options where to get the data to feed it with. You can use a Twitter or Wikipedia river to fill it with gigabytes of public data, or you can feed it very quickly with some RSS feeds.

But, let’s get a bit personal, shall we? Let’s feed it with your own e-mail, imported from your own Gmail account.

A useful way to teach basic searching.

After all, a search of Wikipedia or Twitter may return impressive results, but are they correct results?

Hard for a user to say because both Wikipedia and Twitter are large enough that verification (other than by other programs) of search results isn’t possible.

Assuming your Gmail inbox is smaller than Wikipedia you should be able to recognize what results are “correct” and which ones look “off.”

And you may learn some Ruby in the bargain.

Not a bad day’s work. 😉

PS: You may want to try the links on mining Twitter, Wikipedia and RSS feeds with ElasticSearch.

How To Model Search Term Data To Classify User Intent & Match Query Expectations – Post

Saturday, January 15th, 2011

How To Model Search Term Data To Classify User Intent & Match Query Expectations by Mark Sprague, courtesy of is an interesting piece on analysis of search data to extract user intent.

As interesting as that is, I think it could be used by topic map authors for a slightly different purpose.

What if we were to use search data to classify how users were seeking particular subjects?

That is to mine search data for patterns of subject identification, which really isn’t all that different than deciding what product or what service to market to a user.

As a matter of fact, I suspect that many of the tools used by marketeers could be dual purposed to develop subject identifications for non-marketing information systems.

Such as library catalogs or professional literature searches.

The later being often pay-per-view, maintaining high customer satisfaction means repeat business and work of mouth advertising.

I am sure there is already literature on this sort of mining of search data for subject identifications. If you have a pointer or two, please send them my way.