Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 5, 2011

Your Data, Your Search

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:21 pm

Your Data, Your Search by Karel Minařík.

Slide deck but a very interesting one. Covers the shortcomings of search, an overview of reverse indexing and ends up with ElasticSearch. Along the way he observes that the “analysis” step is often more important than the “search” step. Suspect that analysis is nearly always more important than searching. And certainly harder.

June 1, 2011

groonga

Filed under: groonga,Search Engines — Patrick Durusau @ 6:49 pm

groonga

From the website:

Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

Be aware that most of the documentation is written in Japanese.

Consider it an incentive to learn Japanese, practice Japanese if you already know it but are rusty, or to develop documentation in another language.

May 27, 2011

Zanran

Filed under: Data Source,Dataset,Search Engines — Patrick Durusau @ 12:36 pm

Zanran

A search engine for data and statistics.

I was puzzled by results containing mostly PDF files until I read:

Zanran doesn’t work by spotting wording in the text and looking for images – it’s the other way round. The system examines millions of images and decides for each one whether it’s a graph, chart or table – whether it has numerical content.

Admittedly you may have difficulty re-using such data but finding it is a big first step. You can then contact the source for the data in a more re-usable form.

From Hints & Helps:

Language. English only please… for now.
Phrase search. You can use double quotes to make phrases (e.g. “mobile phones”).
Vocabulary. We have only limited synonyms – please try different words in your query. And we don’t spell-check … yet.

From the website:

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Put more simply: Zanran is Google for data.

Well said.

May 26, 2011

Google Correlate & Party Games

Filed under: Authoring Topic Maps,Search Engines,Searching — Patrick Durusau @ 3:42 pm

Google Correlate

A new service from Google. From the blog entry:

It all started with the flu. In 2008, we found that the activity of certain search terms are good indicators of actual flu activity. Based on this finding, we launched Google Flu Trends to provide timely estimates of flu activity in 28 countries. Since then, we’ve seen a number of other researchers—including our very own—use search activity data to estimate other real world activities.

However, tools that provide access to search data, such as Google Trends or Google Insights for Search, weren’t designed with this type of research in mind. Those systems allow you to enter a search term and see the trend; but researchers told us they want to enter the trend of some real world activity and see which search terms best match that trend. In other words, they wanted a system that was like Google Trends but in reverse.

This is now possible with Google Correlate, which we’re launching today on Google Labs. Using Correlate, you can upload your own data series and see a list of search terms whose popularity best corresponds with that real world trend. In the example below, we uploaded official flu activity data from the U.S. CDC over the last several years and found that people search for terms like [cold or flu] in a similar pattern to actual flu rates…

One use Google Correlate would be party games to guess the correlated terms.

I looked at the “rainfall” correlation example.

For “annual rainfall (in) |correlate| disney vacation package,” I would have guessed “prozac” and not “mildew remover.” Shows what I know.

I am sure topic map authors have other uses for these Google tools. What are yours?

May 23, 2011

Build a distributed realtime tweet search system in no time.

Filed under: Search Engines — Patrick Durusau @ 7:45 pm

Build a distributed realtime tweet search system in no time.

Part 1

Part 2

The next obvious step would be to overlay a topic map onto the tweet store.

Thinking that tweets of interest would be mapped fairly well, tweets that are not, well, that’s just the breaks.

Illustrates the principle that not every subject is going to get mapped.

Some may not be mapped out of lack of interest.

Some may not be mapped because they are outside the scope of a particular project.

Some may not be mapped due to oversight.

There isn’t any moral principle that every post, tweet, email list, or website has to be mapped or even indexed.

Here’s an interesting topic map experiment:

Using a web search engine, create a topic map of an international event but exclude any statements by government agencies or officials.

Think of your topic map as a noise reduction filter.

Suggestions on evaluation mechanisms? How much less noise does your topic map have than CNN?

May 19, 2011

Search Your Gmail Messages with ElasticSearch and Ruby

Filed under: Dataset,ElasticSearch,Search Data,Search Engines,Search Interface — Patrick Durusau @ 3:26 pm

Search Your Gmail Messages with ElasticSearch and Ruby

From the website:

If you’d like to check out ElasticSearch, there’s already lots of options where to get the data to feed it with. You can use a Twitter or Wikipedia river to fill it with gigabytes of public data, or you can feed it very quickly with some RSS feeds.

But, let’s get a bit personal, shall we? Let’s feed it with your own e-mail, imported from your own Gmail account.

A useful way to teach basic searching.

After all, a search of Wikipedia or Twitter may return impressive results, but are they correct results?

Hard for a user to say because both Wikipedia and Twitter are large enough that verification (other than by other programs) of search results isn’t possible.

Assuming your Gmail inbox is smaller than Wikipedia you should be able to recognize what results are “correct” and which ones look “off.”

And you may learn some Ruby in the bargain.

Not a bad day’s work. 😉


PS: You may want to try the links on mining Twitter, Wikipedia and RSS feeds with ElasticSearch.

May 9, 2011

Google at CHI 2011

Google at CHI 2011

From the Google blog:

Google has an increasing presence at ACM CHI: Conference on Human Factors in Computing Systems, which is the premiere conference for Human Computer Interaction research. Eight Google papers will appear at the conference. These papers not only touch on our core areas such as Search, Chrome and Android but also demonstrate our growing effort in new areas where HCI is essential, such as new search user interfaces, gesture-based interfaces and cross-device interaction. They showcase our efforts to address user experiences in diverse situations. Googlers are playing active roles in the conference in many other ways too: participating in conference committees, hosting panels, organizing workshops and teaching courses, as well as running demos and 1:1 sessions at Google’s booth.

The post also has a complete set of links to papers from Google and other materials.

I remember reading something recently about modulating the amount of information sent to a user based on their current activity level. That is a person who was engaged in a task requiring immediate attention (does watching American Idol count?) is sent less information than a person doing something less important (watching a presidential address).

Is merging affected by my activity level or just delivery of less than all the results?

May 6, 2011

Building a better legal search engine, part 1: Searching the U.S. Code

Filed under: Law - Sources,Legal Informatics,Search Engines,Searching — Patrick Durusau @ 12:37 pm

Building a better legal search engine, part 1: Searching the U.S. Code

From the post:

As I mentioned last week, I’m excited to give a keynote in two weeks on Law and Computation at the University of Houston Law Center alongside Stephen Wolfram, Carl Malamud, Seth Chandler, and my buddy Dan from CLS. The first part in my blog series leading up to this talk will focus on indexing and searching the U.S. Code with structured, public domain data and open source software.

He closes with:

Stay tuned next week for the next part in the series. I’ll be using Apache Mahout to build an intelligent recommender system and cluster the sections of the Code.

It won’t pull the same audience share as the “Who shot J.R.?” episode of Dallas, but I have to admit I’m interested in the next part of this series. 😉

May 3, 2011

The History of Search [infographic]

Filed under: Search Engines,Searching — Patrick Durusau @ 1:06 pm

The History of Search [infographic]

James Anderson has produced a history of [Internet] search infographic.

To see the full page view, The History of Search.

Interesting and probably worth having printed as a decorative poster for the office wall.

An infographic that included search techniques, both digital and analog before the Internet would be even more interesting.

April 30, 2011

When Data Mining Goes Horribly Wrong

Filed under: Data Mining,Merging,Search Engines — Patrick Durusau @ 10:22 am

In When Data Mining Goes Horribly Wrong, Matthew Hurst brings us a cautionary tale about what can happen when “merging” decisions are made badly.

From the blog:

Consequently, when you see a details page – either on Google, Bing or some other search engine with a local search product – you are seeing information synthesized from multiple sources. Of course, these sources may differ in terms of their quality and, as a result, the values they provide for certain attributes.

When combining data from different sources, decisions have to be made as to firstly when to match (that is to say, assert that the data is about the same real world entity) and secondly how to merge (for example: should you take the phone number found in one source or another?).

This process – the conflation of data – is where you either succeed or fail.

Read Matthew’s post for encouraging signs that there is plenty of room for the use of topic maps.

What I find particularly amusing is that repair of the merging in this case doesn’t help prevent it from happening again and again.

Not much of a repair if the problem continues to happen elsewhere.

April 15, 2011

Lucene Revolution

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 6:27 am

Lucene Revolution

May 23 – 24, 2011 Training
May 25 – 26, 2011 Conference

San Francisco Airport Hyatt Regency

From the website:

Lucene Revolution 2011 is the largest conference dedicated to open source search. The Lucene Revolution 2011 brings together developers, thought leaders, and market makers who understand that the search technology they’ve been looking has arrived.This is an event that should not be missed by anyone that is using, or considering, Apache Lucene/Solr or LucidWorks Enterprise for their search applications.

You will get a chance to hear from a wide range of speakers, from the foremost experts on open source search technology to a broad cross-section of users that have implemented Lucene, Solr, or LucidWorks Enterprise to improve search application performance, scalability, flexibility, and relevance, while lowering their costs. The two-day conference agenda is packed with technical sessions, developer content, user case studies, panels, and networking opportunities. You will learn new ways to develop, deploy, and enhance search applications using Lucene/Solr — and LucidWorks Enterprise.

Preceding the conference there are two days of intensive hands-on training on Solr, Lucene, and LucidWorks Enterprise on May 23 and 24. Whether you are new to Lucene/Solr, want to brush up on your skills, or want to get new insights, tips & tricks, you will get the knowledge you need to be successful.

This could be very cool.

April 11, 2011

ElasticSearch.org Website Search: Field Notes

Filed under: Search Engines,Searching — Patrick Durusau @ 5:40 am

ElasticSearch.org Website Search: Field Notes

From the post:

Field notes gathered during installing and configuring ElasticSearch for http://elasticsearch.org

ElasticSearch is something you are going to encounter and these sysadmin type notes should get you started.

April 7, 2011

How to search the documentation of all CRAN packages

Filed under: R,Search Engines,Searching — Patrick Durusau @ 7:27 pm

How to search the documentation of all CRAN packages

Now there is a damned odd title for a post these days. 😉

I mean after releases of Lucene 3.1, Solr 3.1, not to mention other indexing/searching clients/platforms, why would anyone need a post on finding a specific function or algorithm?

You just put what you are looking for in your favorite search tool and …., oh yeah, it isn’t just put your lips together and blow is it?

Rather than saying you can find it, this post should say you can search for it.

Because functions and algorithms may not have the names you expect.

To handle that problem you would need a topic map.

April 1, 2011

Solr 3.1 (Lucene 3.1) Released!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:10 pm

Solr 3.1 (Lucene 3.1) Released!

Solr 3.1, which contains Lucene 3.1, was released on 31 March 2011.

Some of the new features include:

Quick links:

March 21, 2011

MG4J – Managing Gigabytes for Java

Filed under: Indexing,Search Engines,Searching — Patrick Durusau @ 8:52 am

MG4J – Managing Gigabytes for Java

From the website:

The main points of MG4J are:

  • Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
  • Efficiency. We do not provide meaningless data such as “we index x GiB per second” (with which configuration? which language? which data source?)—we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents.
  • Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
  • Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
  • Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
  • Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It’s up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
  • Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
  • Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
  • Multithreading. Indices can be queried and scored concurrently.
  • Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.

March 20, 2011

99 Problems, But The Search Ain’t One

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 1:25 pm

99 Problems, But The Search Ain’t One

Slides and video from UK PHP presentation by Andrei Zmievski on ElasticSearch.

From the webpage:

ElasticSearch is the new kid on the search block. Built on top of Lucene and adhering to the best concepts of so-called NoSQL movement, ElasticSearch is a distributed, highly available, fast RESTful search engine, ready to be plugged into Web applications. Come to this session and learn how to set up, index, search, and tune ElasticSearch in less time than it takes to order a latte (disclaimer: at sufficiently busy central Starbucks locations. Side effects may include euphoria, stuff getting done, and extra time to spend with girlfriend).

While I appreciate an optimistic (enthusiastic?) presentation and I like ElasticSearch, predictions of the end of searching problems is a bit premature. 😉

I commend the article to you but would note that the search problems addressed by topic maps, such as:

  1. Different identifications of the same subject
  2. Re-use of the same identifiers for different subjects
  3. Inability to reliably merge indexes from more than one source

All remain with ElasticSearch.

March 8, 2011

Toward Topic Search on the Web – Paper

Filed under: Probalistic Models,Search Engines — Patrick Durusau @ 9:53 am

Toward Topic Search on the Web was reported by Infodocket.com.

Report from Microsoft researchers on “…framework that improves web search experiences through the use of a probabilistic knowledge base.”

Interesting report.

Even more so if you think about topic maps as declarative knowledge bases and consider the use of probabilistic knowledge bases as a means to add to the former.

BTW, user satisfaction was used as the criteria for success.

Now that is local semantics.

Probably at just about the right level as well.

Comments?

March 7, 2011

Microsoft Academic Search

Filed under: Dataset,Search Engines — Patrick Durusau @ 7:09 am

Microsoft Academic Search

I ran across a reference to this search engine in a thread bitching about ranking of publications, etc.

I suppose but my first reaction was like a kid in a candy store.

Hard to know of:

  • Algorithms & Theory
  • Artificial Intelligence
  • Bioinformatics & Computational Biology
  • Computer Education
  • Computer Vision
  • Databases
  • Data Mining
  • Distributed & Parallel Computing
  • Graphics
  • Hardware & Architecture
  • Human-Computer Interaction
  • Information Retrieval
  • Machine Learning & Pattern Recognition
  • Multimedia
  • Natural Language & Speech
  • Networks & Communications
  • Operating Systems
  • Programming Languages
  • Real-Time & Embedded Systems
  • Scientific Computing
  • Security & Privacy
  • Simulation
  • Software Engineering
  • World Wide Web
  • Computer Science Overall
  • Other Domains Overall

…which to choose first!

As far as the critics of this site, I have to agree it isn’t everything it could be.

But that is a good thing because it leaves Microsoft and everyone else something to strive for.

I don’t have any illusions about corporate entities, including Microsoft.

But, all of them have people working for them who do good work, that benefits the public interest, and who are doing so while working for a corporate entity.

I know that because I know people who work for a number of the larger software corporate entities.

I am sure you know some of them too.

March 2, 2011

Collaborative Web Search (Haystack)

Filed under: Search Engines,Search Interface — Patrick Durusau @ 1:00 pm

Jeff Dalton reports the launch of Haystack, a collaborative web search startup.

I suspect that while useful within small groups, as shared search results propagate outwards, they will encounter the same semantic dissonance as tagging.

WhistlePig: A minimalist real-time search engine

Filed under: Search Engines,Search Interface — Patrick Durusau @ 12:39 pm

WhistlePig: A minimalist real-time search engine.

From Jeff Dalton’s blog:

William Morgan recently announced the release of Whistlepig, a real-time search engine written in C with Ruby bindings. It is now up to release 0.4. Whistlepig is a minimalist in memory search system with ranking by reverse date. You can read William’s blog post for his motivations for writing it.

Of particular interest (at least to me):

  • A full query language and parser with conjunctions, disjunctions, phrases, negations, grouping, and nesting.
  • Labels: arbitrary tokens which can be added to and removed from documents at any point, and incorporated into search queries.

February 22, 2011

Luke

Filed under: Hadoop,Lucene,Maps,Marketing,Search Engines — Patrick Durusau @ 1:34 pm

Luke

From the website:

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • browse by document number, or by term
  • view documents / copy to clipboard
  • retrieve a ranked list of most frequent terms
  • execute a search, and browse the results
  • analyze search results
  • selectively delete documents from the index
  • reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes
  • open indexes consisting of multiple parts, and located on Hadoop filesystem
  • and much more…

Searching is interesting and I have several more search engines to report this week, but the real payoff is finding.

And recording the finding so that other users can benefit from it.

We could all develop our own maps of the London Underground, at the expense of repeating the effort of others.

Or, we can purchase a copy of the London Underground.

Which one seems more cost effective for your organization?

elasticsearch

Filed under: Lucene,NoSQL,Search Engines — Patrick Durusau @ 1:18 pm

elasticsearch

From the website:

So, we build a web site or an application and want to add search to it, and then it hits us: getting search working is hard. We want our search solution to be fast, we want a painless setup and a completely free search schema, we want to be able to index data simply using JSON over HTTP, we want our search server to be always available, we want to be able to start with one machine and scale to hundreds, we want real-time search, we want simple multi-tenancy, and we want a solution that is built for the cloud.

This should be easier“, we declared, “and cool, bonsai cool“.

elasticsearch aims to solve all these problems and more. It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene.

Another contender in the space for search engines.

Do you have a favorite search engine? If so, what about it makes it your favorite?

February 21, 2011

Sphinx

Filed under: Search Engines,Searching — Patrick Durusau @ 7:07 am

Sphinx – Open Source Search Server

Benjamin Bock mentioned Sphinx in a Twitter posting and so I had to go see what he was reading about.

The short version from the website:

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.

A variety of text processing features enable fine-tuning Sphinx for your particular application requirements, and a number of relevance functions ensures you can tweak search quality as well.

Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.

For more specific details, see: About Sphinx.

Comments welcome.

January 19, 2011

Curation is the New Search is the New Curation – Post

Filed under: Indexing,Search Engines,Search Interface,Searching — Patrick Durusau @ 1:22 pm

Curation is the New Search is the New Curation

Paul Kedrosky sees a return to curation as the next phase in searching. In part because search algorithms can be gamed…, but read the post. He has an interesting take on the problem.

The one comment I would add is that curation will mean not everything is curated.

Should it be?

What criteria would you use for excluding material to be curated from your index of (insert your favorite topic)?

Proposition: It is an error to think everything that can be searched is worth indexing (or curation).

January 13, 2011

Document Indexing – Wrong Level?

Filed under: Indexing,Search Engines — Patrick Durusau @ 8:16 am

I was reading the Jaccard distance treatment in Anand Rajaraman and Jeffrey D. Ullman and something that keeps nagging at me became clearer.

Is document indexing the wrong level for indexing?

Take a traditional research paper as an example.

You would give me low marks if I handed in a paper with the following as one of my footnotes:

# Principia Mathematica, Volume 1

But that is a perfectly acceptable result for a search engine. I am pointed to an entire document as relevant to my search.

True enough but hardly very helpful.

Search engines can take me to a document but that still leaves all the hard work to me.

Not that I mind the hard work but that hard work is done over and over again, as each user encounters the document.

Seems terribly inefficient to have the same work done each time the document is returned.

Say for example that I am searching for the proof that 1 + 1 = 2, I should be able to create a representative for that subject that points every searcher to the same location. As opposed to them digging out that bit of information for themselves.

I have heard that bit of information assigned various locations in Principia Mathematica. I am acquiring a reprint so I can verify its location for myself and will be posting its location.

Topic maps help because they are about subject indexing which I take to be different from document indexing.

A document index only tells you that somewhere in a document, one or more terms relevant to your search may be found. Not terribly helpful.

A subject index, on the other hand, particularly if made using a topic map, not only isolates the location of a subject but can also tell you additional information about the subject. Such as other information about the subject.

January 10, 2011

Engineering basic algorithms of an in-memory text search engine

Filed under: Data Structures,Indexing,Search Engines — Patrick Durusau @ 4:37 pm

Engineering basic algorithms of an in-memory text search engine Authors: Frederik Transier, Peter Sanders Keywords: Inverted index, in-memory search engine, randomization

Abstract:

Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.

Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.

A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

An interesting comparison of inverted indexes with suffix-arrays.

I am troubled by the reconstruct the input aspects of the paper.

While it is understandable and in some cases, more efficient, for data to be held in a localized data structure, my question is what do we do when data exceeds local storage capacity?

Think about the data held by Lexis/Nexis for example. Where would we put it while creating a custom data structure for its access?

There are data sets, important data sets, that have to be accessed in place.

And those data sets need to be addressed using topic maps.

*****
You may recall from the TAO paper by Steve Pepper the illustration of topics, associations and occurrences floating above a data set.

While topic map formats have been useful in many ways, they have distracted from the vision of topic maps as an information overlay as opposed to yet-another-format.

Formats are just that, formats. Pick one.

January 6, 2011

Lucene and Solr: 2010 in Review – Post

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 2:55 pm

Lucene and Solr: 2010 in Review

Great highlights of a busy and productive year for both Lucene and Solr.

January 2, 2011

Why We Desperately Need a New (and Better) Google – Post

Filed under: Search Engines — Patrick Durusau @ 2:23 pm

Why We Desperately Need a New (and Better) Google

Vivek Wadhwa compares Blekko with Google.

The comments on the post, some of them anyway, were as interesting as the post itself.

Questions:

  1. What should Google (or any other search engine) do better in your opinion? (3-5 pages, no citations)
  2. Should new search engines re-index the Internet, or target sub-parts of the Internet? (3-5 pages, no citations)
  3. The unicorn of the Internet, some obscure site with relevant information is mentioned. How serious is the requirement to find every relevant site? If results > 100 go unexamined, what does it matter? (3-5 pages, no citations)
  4. Which would you find more useful: 10 articles relevant to your search topic or > 100 search engine results? (3-5 pages, no citations)

December 20, 2010

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word – Post

Filed under: Humor,Indexing,Search Engines — Patrick Durusau @ 7:24 pm

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word

Courtesy of http://searchengineland.com‘s Danny Sullivan, a highly amusing post on Google’s Ngram viewer.

Danny’s post only covers changing spelling and character rendering but serves to illustrate that the broader the time period covered, the greater the care needed to have results that make any sense at all.

Quite the post for the holidays!

Building blocks of a scalable webcrawler

Filed under: Indexing,Search Engines,Webcrawler — Patrick Durusau @ 4:41 am

Building blocks of a scalable webcrawler

From Marc Seeger’s post about his thesis:

This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, …) and the data-structures that they use in the background.

Questions:

  1. What components would need to be added to make this a semantic crawling project? (3-5 pages, citations)
  2. What scalability issues would semantic crawling introduce? (3-5 pages, citations)
  3. Design a configurable, scalable, semantic crawler. (Project)
« Newer PostsOlder Posts »

Powered by WordPress