Sehrch.com: A Structured Search Engine Powered By Hypertable
From the introduction:
Sehrch.com is a structured search engine. It provides powerful querying capabilities that enable users to quickly complete complex information retrieval tasks. It gathers conceptual awareness from the Linked Open Data cloud, and can be used as (1) a regular search engine or (2) as a structured search engine. In both cases conceptual awareness is used to build entity centric result sets.
To facilitate structured search we have introduced a new simple search query syntax that allows for bound properties and relations (contains, less than, more than, between, etc). The initial implementation of Sehrch.com was built over an eight month period. The primary technologies used are Hypertable and Lucene. Hypertable is used to store all Sehrch.com data which is in RDF (Resource Description Framework). Lucene provides the underlying searcher capability. This post provides an introduction to structured web search and an overview of how we tackled our big data problem.
A bit later you read:
We achieved a stable loading throughput of 22,000 triples per second (tps), peaking at 30,000 tps. Within 24 hours we had loaded the full 1.3 billion triples on a single node, on hardware that was at least two years out of date. We were shocked, mostly because on the same hardware the SPARQL compliant triplestores had managed 80 million triples (Virtuoso) at the most and Hypertable had just loaded all 1.3 billion. The 500GB of input RDF had become a very efficient 50GB of Hypertable data files. But then, loading was only half of the problem, could we query? We wrote a multi-threaded data exporter that would query Hypertable for entities by subject (Hypertable row key) randomly. We ran the exporter, and achieved speeds that peaked at 1,800 queries per second. Again we were shocked. Now that the data the challenge had set forth was loaded, we wondered how far Hypertable could go on the same hardware.
So we reloaded the data, this time appending the row keys with 1. Hypertable completed the load again, in approximately the same time. So we ran it again, now appending the keys with 2. Hypertable completed again, again in the same time frame. We now had a machine which was only 5% of our eventual production specification that stored 3.6 billion triples, three copies each of DBpedia and Freebase. We reran our data exporter and achieved query speeds that ranged between 1,000-1,300 queries per second. From that day on we have never looked back, Hypertable solved our data storage problem, it smashed the challenge that we set forth that would determine if Sehrch.com was at all possible. Hypertable made it possible.
That’s performance by design, not brute force.
On the other hand, the results of: Pop singers less than 20 years old, could be improved. Page after page of Miley Cyrus results gets old in a hurry. 😉
I am sure the team at Sehrch.com would appreciate your suggestions and comments.