Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 18, 2011

eTBLAST: a text-similarity based search engine

Filed under: eTBLAST,Search Engines,Similarity — Patrick Durusau @ 9:37 pm

eTBLAST: a text-similarity based search engine

eTBLAST from the Virginia Bioinformatics Institute at Virginia Tech lies at the heart of the Deja vu search service.

Unlike Deja vu, the eTBLAST interface is not limited to citations in PubMed, but includes:

  • MEDLINE
  • CRISP
  • NASA
  • Medical Cases
  • PMC Full Text
  • PMC METHODS
  • PMC INTRODUCTION
  • PMC RESULTS
  • PMC (paragraphs)
  • PMC Medical Cases
  • Clinical Trials
  • Arxiv
  • Wikipedia
  • VT Courses

There is a set of APIs, limited to some of the medical material.

While I was glad to see Arxiv included, given my research interests, CiteSeerX, or the ACM, IEEE, ALA or other information/CS would be of greater interest.

Among other things, I would like to have the ability to create a map of synonyms (can you say “topic map?”) that could be substituted during the comparison of the literature.

But, in the meantime, definitely will try the interface on Arxiv to see how it impacts my research on techniques relevant to topic maps.

October 26, 2011

Google removes plus (+) operator

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 6:58 pm

Google removes plus (+) operator

From the post:

The “+” operator used to mean “required” to Google, I think. But it also meant “and exactly that word is required, not an alternate form.” I think? Maybe it always was just a synonym for double quotes, and never meant ‘required’? Or maybe double quotes mean ‘required’ too?

At any rate, the plus operator is gone now.

I’m not entirely sure that the quotes will actually insist on the quoted word being present in the page? Can anyone find a counter-example?

I had actually noticed a while ago that the google advanced search page had stopped providing any fields that resulted in “+”, and was suggesting double quotes for “exactly this form of word” (not variants), rather than “phrase”. Exactly what given operators (and bare searches) do has continually evolved over time, and isn’t always documented or reflected in the “search tips” page or “advanced search” screen.

The post is a good example of using the Internet Archive to research the prior state of the web.

BTW, the comments and discussion on this were quite amusing. “Kelly Fee,” a Google employee had these responses to questions about removal of the “+” operator:

We’ve made the ways you can tell Google exactly what you want more consistent by expanding the functionality of the quotation marks operator. In addition to using this operator to search for an exact phrase, you can now add quotation marks around a single word to tell Google to match that word precisely. So, if in the past you would have searched for [magazine +latina], you should now search for [magazine “latina”].

We’re constantly making changes to Google Search – adding new features, tweaking the look and feel, running experiments, – all to get you the information you need as quickly and as easily as possible. This recent change is another step toward simplifying the search experience to get you to the info you want.

If you read the comments, having a simple search experience wasn’t the goal of most users. Finding relevant information was.

Kelly reassures users they are being heard, but ignored:

Thanks for sharing your thoughts. I especially appreciate everyone’s passion for search operators (if only every Google Search user were aware of these tools like you are…).

One thing I’d like to add to my original post is that, as with any change we make to our search engine, we put a lot of thought into this modification, but we’re always interested in user feedback.

I hope that you’ll continue to give us feedback in the future so that we can make your experience on Google more enjoyable.

After a number of posts on the lost of function by elimination of the “+” operator, Kelly creatively mis-hears the questions and comes up with an example that works.

I just tested out the quotes operator to make sure that it still works for phrases and it does. I searched for [from her eyes] and then [“from her eyes”] and got different results. I also tried [from her “eye”] and [from her eye] and got different results for each query, which is how it is intended to work.

Many people understand that putting quotes around a phrase tells a search engine to search for that exact phrase. This change applies that same idea to a specific word.

Would it help to know that Kelly Fee was a gymnast?

October 18, 2011

Search Algorithms with Google Director of Research Peter Norvig

Filed under: Search Algorithms,Search Engines,Searching — Patrick Durusau @ 2:40 pm

Search Algorithms with Google Director of Research Peter Norvig

From the post:

As you will see in the transcript below, this discussion focused on the use of artificial intelligence algorithms in search. Peter outlines for us the approach used by Google on a number of interesting search problems, and how they view search problems in general. This is fascinating reading for those of you who want to get a deeper understanding of how search is evolving and the technological approaches that are driving it. The types of things that are detailed in this interview include:

  1. The basic approach used to build Google Translate
  2. The process Google uses to test and implement algorithm updates
  3. How voice driven search works
  4. The methodology being used for image recognition
  5. How Google views speed in search
  6. How Google views the goals of search overall

Some of the particularly interesting tidbits include:

  1. Teaching automated translation systems vocabularly and grammar rules is not a viable approach. There are too many exceptions, and language changes and evolved rapidly. Google Translate uses a data driven approach of finding millions of real world translations on the web and learning from them.
  2. Chrome will auto translate foreign language websites for you on the fly (if you want it to).
  3. Google tests tens of thousands of algorithm changes per year, and make one to two actual changes every day
  4. Test is layered, starting with a panel of users comparing current and proposed results, perhaps a spin through the usability lab at Google, and finally with a live test with a small subset of actual Google users.
  5. Google Voice Search relies on 230 billion real world search queries to learn all the different ways that people articulate given words. So people no longer need to train their speech recognition for their own voice, as Google has enough real world examples to make that step unecessary.
  6. Google Image search allows you to drag and drop images onto the search box, and it will try to figure out what it is for you. I show a screen shot of an example of this for you below. I LOVE that feature!
  7. Google is obsessed with speed. As Peter says “you want the answer before you’re done thinking of the question”. Expressed from a productivity perspective, if you don’t have the answer that soon your flow of thought will be interrupted.

Reading the interview it occurred to me that perhaps, just perhaps, that authoring semantic applications, whether Semantic Web or Topic Maps, that we have been overly concerned with “correctness.” More so on the logic side where applications fall on their sides when they encounter outliers but precision is also the enemy of large scale production of topic maps.

What if we took a tack from Google’s use of a data driven approach to find mappings between data structures and the terms in data structures? I know automated techniques have been used for preliminary mapping of schemas before. What I am suggesting that we capture the basis for the mapping, so we can improve or change it.

Although there are more than 70 names for “insurance policy number” in information systems, I suspect that within the domain those stand in relationship to other subjects that would assist in refining a mining of those terms over time. Rather than making mining/mapping a “run it again Sam” type event, capturing that information could improve our odds at other mappings.

Depending on the domain, how accurate does it need to be? Particularly since we can build feedback into the systems so that as users encounter errors, those are corrected and cascade back to other users. Places users don’t visit may be wrong, but if no one visits, what difference does it make?

Very compelling interview and I suggest you read it in full.

October 16, 2011

Going Head to Head with Google (and winning)

Filed under: Search Engines,Searching — Patrick Durusau @ 4:07 pm

ETDEWEB versus the World-Wide-Web: A Specific Database/Web Comparison

I really need to do contract work writing paper titles. 😉

Cutting to the chase:

For the 15 topics in this study, ETDEWEB was shown to bring the user unique results not shown by Google or Google Scholar 86.7% of the time.

Caveat, these were topics where ETDEWEB is strong and did not include soft/hard porn, political blogs and similar material.

Abstract:

A study was performed comparing user search results from the specialized scientific database on energy related information, ETDEWEB, with search results from the internet search engines Google and Google Scholar. The primary objective of the study was to determine if ETDEWEB (the Energy Technology Data Exchange – World Energy Base) continues to bring the user search results that are not being found by Google and Google Scholar. As a multilateral information exchange initiative, ETDE’s member countries and partners contribute cost- and task-sharing resources to build the largest database of energy related information in the world. As of early 2010, the ETDEWEB database has 4.3 million citations to world-wide energy literature. One of ETDEWEB’s strengths is its focused scientific content and direct access to full text for its grey literature (over 300,000 documents in PDF available for viewing from the ETDE site and over a million additional links to where the documents can be found at research organizations and major publishers globally). Google and Google Scholar are well-known for the wide breadth of the information they search, with Google bringing in news, factual and opinion-related information, and Google Scholar also emphasizing scientific content across many disciplines. The analysis compared the results of 15 energy-related queries performed on all three systems using identical words/phrases. A variety of subjects was chosen, although the topics were mostly in renewable energy areas due to broad international interest. Over 40,000 search result records from the three sources were evaluated. The study concluded that ETDEWEB is a significant resource to energy experts for discovering relevant energy information. For the 15 topics in this study, ETDEWEB was shown to bring the user unique results not shown by Google or Google Scholar 86.7% of the time. Much was learned from the study beyond just metric comparisons. Observations about the strengths of each system and factors impacting the search results are also shared along with background information and summary tables of the results. If a user knows a very specific title of a document, all three systems are helpful in finding the user a source for the document. But if the user is looking to discover relevant documents on a specific topic, each of the three systems will bring back a considerable volume of data, but quite different in focus. Google is certainly a highly-used and valuable tool to find significant ‘non-specialist’ information, and Google Scholar does help the user focus on scientific disciplines. But if a user’s interest is scientific and energy-specific, ETDEWEB continues to hold a strong position in the energy research, technology and development (RTD) information field and adds considerable value in knowledge discovery.

October 10, 2011

A Basic Full Text Search Server in Erlang

Filed under: Erlang,Search Engines,Searching — Patrick Durusau @ 6:17 pm

A Basic Full Text Search Server in Erlang

From the post:

This post explains how to build a basic full text search server in Erlang. The server has the following features:

  • indexing
  • stemming
  • ranking
  • faceting
  • asynchronous search results
  • web frontend using websockets

Familiarity with the OTP design principles is recommended.

Looks like a good way to become familiar with Erlang and text search issues.

September 30, 2011

ElasticSearch: Beyond Full Text Search

Filed under: ElasticSearch,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 7:07 pm

ElasticSearch: Beyond Full Text Search by Karel Minařík.

If you aren’t into hard core searching already, this is a nice introduction to the area. Would like to see the presentation that went with the slides but even the slides alone should be useful.

September 28, 2011

Solr and LucidWorks Enterprise: When to use each

Filed under: LucidWorks,Search Engines,Solr — Patrick Durusau @ 7:36 pm

Solr and LucidWorks Enterprise: When to use each

From the post:

If LucidWorks Enterprise is built on Solr, how do you know which one to use when for your own circumstances? This article describes the difference between using straight Solr, using the LucidWorks Enterprise user interface, and using LucidWorks Enterprise’s ReST API for accomplishing various common tasks so you can see which fits your situation at a given moment.

In today’s world, building the perfect product is a lot like trying to repair a set of train tracks while the train is barreling down on you. The world just keeps moving, with great ideas and new possibilities tempting you every day. And to make things worse, innovation doesn’t just show its face for you; it regularly visits your competitors as well.

That’s why you use open source software in the first place. You have smart people; does it make sense to have them building search functionality when Apache Solr already provides it? Of course not. You’d rather rely on the solid functionality that’s already been built by the community of Solr developers, and let your people spend their time building innovation into your own products. It’s simply a more efficient use of resources.

But what if you need search-related functionality that’s not available in straight Solr? In some cases, you may be able to fill those holes and lighten your load with LucidWorks Enterprise. Built on Solr, LucidWorks Enterprise starts by simplifying the day-to-day use tasks involved in using Solr, and then moves on to adding additional features that can help free up your development team for work on your own applications. But how do you know which path would be right for you?

Since I posted the LucidWorks 2.0 announcement yesterday, I thought this might be helpful in terms of its evaluation. I did not see a date on it but it looks current enough.

Thoora is Your Robot Buddy for Exploring Web Topics

Filed under: Search Engines,Searching,Social Networks — Patrick Durusau @ 7:34 pm

Thoora is Your Robot Buddy for Exploring Web Topics by Jon Mitchell. (on ReadWriteWeb)

From the post:

With a Web full of stuff, discovery is a hard problem. Search engines were the first tools on the scene, but their rankings still have a hard time identifying relevance the same way a human user would. These days, social networks are the substitute for content discovery, and even the major search engines are using your social signals to determine what’s relevant for you. But the obvious problem with social search is that if your friends haven’t discovered it yet, it’s not on your radar.

At some point, someone in the social graph has to discover something for the first time. With so much new content getting churned out all the time, a Web surfer looking for something original could use some algorithmic help. A new app called Thoora, which launched its public beta last week, uses the power of machine learning to help users uncover new content on topics that interest them.

Create topic, Thoora suggests keywords, choose some, can declare them to be equivalent, results shared with others by default.

Users who create “good” topics can develop followings.

Although topics can be shared, the article does not mention sharing keywords.

Seems like a missed opportunity to crowd-source keywords from multiple “good” authors on the same topic to improve the results. That is you supply five or six keywords for topic A and I come along and suggest some additional keywords for topic A, perhaps from a topic I already have. Would require “acceptance” by the first user but that should not be hard.

I was amused to read in the Thoora FAQ:

Finally, Google News has no social component. Thoora was created so that topics could be shared and followed, because your topics – once painted with your expert brush – are super-valuable to others and ripe for sharing.

Sharing keywords is far more powerful that sharing topics. We have all had the experience of searching for something and a companion suggests a different word and we find the object of our search. Sharing in Thoora now is like following tweets. Useful, but not all that it could be.

If you decide to use Thoora, would appreciate your views and comments.

September 27, 2011

LucidWorks 2.0, the search platform for Apache Solr/Lucene (stolen post)

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 6:48 pm

LucidWorks 2.0, the search platform for Apache Solr/Lucene by David M. Fishman.

Apologies to David because I stole his entire post, with links to the Lucid site. Could not figure out what to leave out so I included it all.

If you’re a search application developer or architect, if you’ve got big data on your hands or on the brain, or if you’ve got big plans for Apache Lucene/Solr, this announcement is for you.

Today marks the 2.0 release of LucidWorks, the search platform that accelerates and simplifies development of highly accurate, scalable, and cost-effective search applications. We’ve bottled the best of Apache Lucene/Solr, including key innovations from the 4.x branch, in a commercial-grade package that’s designed for the rigors of production search application deployment.

Killer search applications are popping up everywhere, and it’s no surprise. On the one hand, big data technologies disrupting old barriers of speed, structure, cost and addressability of data storage; on the other, the new frontier of query-driven analytics is shifting from old-school reporting to instant, unlimited reach into mixed data structures, driven by users. (There are places these converge: 7 years of data in Facebook combine content with user context, creating a whole new way to look at life as we know it on line.)

Or, to put it a little less breathlessly: Search is now the UI for Big Data. LucidWorks 2.0 is the only distribution of Apache Solr/Lucene that lets you:

  • Build killer business-critical search apps more quickly and easily
  • Streamline search setup and optimization for more reliable operations
  • Access big data and enterprise content faster and more securely
  • Scale to billions without spending millions

If you surf through our website, you’ll find info on features and benefits, screenshots, a detailed technical overview, and access to product documentation. But that’s all talk. Download LucidWorks Enterprise 2.0, or apply for a spot in the Private Beta for LucidWorks Cloud, and take it for a spin.

They say imitation is the sincerest form of flattery. Maybe that will make David feel better!

Seriously, this is an important milestone, both for today and for what is yet to come in the search arena.

September 26, 2011

Ergodic Control and Polyhedral approaches to PageRank Optimization

Filed under: PageRank,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 6:58 pm

Ergodic Control and Polyhedral approaches to PageRank Optimization by Olivier Fercoq, Marianne Akian, Mustapha Bouhtou, Stéphane Gaubert (Submitted on 10 Nov 2010 (v1), last revised 19 Sep 2011 (this version, v2))

Abstract:

We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a “master” page to which all controlled pages should point. We report numerical results on fragments of the real web graph.

I mention this research to raise several questions:

  1. Does PageRank have a role to play in presentation for topic map systems?
  2. Should PageRank results in topic map systems be used assign subject identifications?
  3. If your answer to #2 is yes, what sort of subjects and how would you design the user choices leading to them?
  4. Are you monitoring user navigations of your topic maps?
  5. Has user navigation of your topic maps affected their revision or design of following maps?
  6. Are the navigations in #5 the same as choices based on search results? (In theory or practice.)
  7. Is there an optimal strategy for linking nodes in a topic map?

September 14, 2011

Seven Deadly Sins of Solr

Filed under: Design,Enterprise Integration,Search Engines,Solr — Patrick Durusau @ 7:06 pm

7 Ways to Ensure Your Lucene/Solr Implementation Fails

From the post:

CMSWire spoke with Lucene/Solr expert Jay Hill of Lucid Imagination for a few tips on things to avoid when implementing Lucene/Solr to reduce the risk of your search project biting the dust. Hill calls them the “Seven Deadly Sins of Solr” – sloth, greed, pride, lust, envy, gluttony and wrath.

Read for Solr projects. Recast and read for other projects as well.

September 2, 2011

Groonga

Filed under: Column-Oriented,NoSQL,Search Engines,Searching — Patrick Durusau @ 7:54 pm

Groonga

From the webpage:

Groonga is an open-source fulltext search engine and column store. It lets you write high-performance applications that requires fulltext search.

The latest release is 1.2.5, released 2011-08-29.

Most of the documentation is in Japanese so I can’t comment on it.

Think of this as an opportunity to (hopefully) learn some Japanese. Given the rate of computer science research in Japan it will not be wasted effort.

PS: If you already read Japanese, feel free to contribute some comments on Groonga.

August 29, 2011

Building Search App for Public Mailing Lists

Filed under: ElasticSearch,Search Engines,Search Interface,Searching — Patrick Durusau @ 6:25 pm

Building Search App for Public Mailing Lists in 15 Minutes with ElasticSearch by Lukáš Vlček.

You will need the slides to follow the presentation: Building Search App for Public Mailing Lists.

Very cool if fast presentation on building an email search application with ElasticSearch.

BTW, the link to BigDesk (A tiny monitoring tool for ElasticSearch clusters) is incorrect. Try: https://github.com/lukas-vlcek/bigdesk.

August 28, 2011

Road To A Distibuted Search Engine

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 7:54 pm

Road To A Distributed Search Engine by Shay Banon.

If you are looking for a crash course on the construction details of Elasticsearch, you are in the right place.

My only quibble and this is common to all really good presentations (this is one of those) is that there isn’t a transcript to go along with it. There is so much information that I will have to watch it more than once to take it all in.

If you watch the presentation, do pay attention so you are not like the person who suggested that Solr and Elasticsearch were similar. 😉

August 1, 2011

99 Problems, But the Search Ain’t One

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:55 pm

99 Problems, But the Search Ain’t One

A fairly comprehensive overview of elasticsearch, including replication/sharding and API summaries.

Depending on the type of search and “aggregation” (read merging) you require, this may fit the bill.

July 27, 2011

Open Source Search Engines (comparison)

Filed under: Information Retrieval,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 7:02 pm

Open Source Search Engines (comparison)

A comparison of ten (10) open source search engines.

Appears as an appendix to Modern Information Retrieval, second edition.

I probably don’t need yet another IR book.

But the first edition was well written, the second edition website includes teaching slides for all chapters, a nice set of pointers to additional resources, problems and solutions is “under construction” as of 27 July 2011, all of which are things I like to encourage in authors.

OK, I talked myself into it, I am ordering a copy today. 😉

More comments to follow.

July 23, 2011

Lucene.net is back on track

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 3:08 pm

Lucene.net is back on track by Simone Chiaretta

From the post:

More than 6 months ago I blogged about Lucene.net starting his path toward extinction. Soon after that, due to the “stubbornness” of the main committer, a few forks appeared, the biggest of which was Lucere.net by Troy Howard.

At the end of the year, despite the promises of the main committer of complying to the request of the Apache board by himself, nothing happened and Lucene.net went really close to be being shut down. But luckily, the same Troy Howard that forked Lucene.net a few months before, decided, together with a bunch of other volunteers, to resubmit the documents required by the Apache Board for starting a new project into the Apache Incubator; by the beginning of February the new proposal was voted for by the Board and the project re-entered the incubator.

If you are interested in search engines and have .Net skills (or want to acquire them), this would be a good place to start.

July 20, 2011

…Develop[ing] Personal Search Engine

Filed under: Marketing,Search Engines — Patrick Durusau @ 1:01 pm

Ness Computing Announces $5M Series A Financing to Develop Personal Search Engine

From the post:

SILICON VALLEY, Calif., July 19, 2011 /PRNewswire/ — Ness Computing is announcing that it raised a $5M Series A round of financing in November 2010. The round was led by Vinod Khosla and Ramy Adeeb of Khosla Ventures, with participation from Alsop Louie Partners, TomorrowVentures, Bullpen Capital, a co-founder of Palantir Technologies and several angel investors. This financing is enabling the company’s team of engineers and scientists, with expertise in information retrieval and machine learning, to pursue their vision to change the nature of search by building technology that delivers results and recommendations that are unique to each person using it.

The technology, which the company calls a Likeness Engine, represents a new approach to this complex engineering challenge by fusing a search engine and a recommendation engine, and will power the company’s first product, a mobile service called Ness. The Likeness Engine is different from traditional search engines that are useful for finding fact-based objective information that is the same for everyone, such as weather reports, dictionary terms, and stock prices. Ness Computing’s vision is to answer questions of a more subjective nature by understanding each person’s likes and dislikes, to deliver results that match his or her personal tastes. This can be seen in the difference between a person asking, “Which concerts are playing in New York City?” and “Which concerts would I most enjoy in New York City?” Ultimately, Ness aims to help people make decisions about dining, nightlife, entertainment, shopping, music, travel and more, culled expressly for them from the world’s almost limitless options.

Impressive array of previously successful talent.

I am not sure I buy the “objective” versus “subjective” information divide but clearly Ness is interested in learning the user’s view of the world in order to “better” answer their questions.

Depending on how successful the searches by Ness become, a user could become insulated in a cocoon of previous expressions of likes and dislikes.

That isn’t an original insight, I saw it somewhere in an article about personalized search results from search engines. Nor is it a problem that arose due to personalization of search engines.

The average user (read not a librarian), tends to search for terms in a field or subject area that they already know. So they are unlikely to encounter information that uses different terminology. In a very real way, user’s searches are already highly personalized.

Personalization isn’t a bad thing but it is a limiting thing. That is it puts a border on the information that you will get back from a search and you won’t have much of an opportunity to go beyond that. It simply never comes up. And information overload being what it is, having limited, safe results can be quite useful. Particularly if you like sleeping at the Holiday Inn, eating at McDonald’s and watching American Idol.

Hopefully Ness will address the semantic diversity issue in order to provide users, at least the ones who are interested, with a richer search experience. Topic maps would be useful in such an attempt.

July 19, 2011

Build your own internet search engine

Filed under: Erlang,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 7:53 pm

Build your own internet search engine by Daniel Himmelein.

Uses Erlang but also surveys the Apache search stack.

Not that you have to roll your own search engine but it will give you a different appreciate for the issues they face.


Update: Build your own internet search engine – Part 2

I ran across part 2 while cleaning up at year’s end. Enjoy!

July 17, 2011

Building blocks of a scalable web crawler

Filed under: Indexing,NoSQL,Search Engines,Searching,SQL — Patrick Durusau @ 7:29 pm

Building blocks of a scalable web crawler Thesis by Marc Seeger. (2010)

Abstract:

The purpose of this thesis was the investigation and implementation of a good architecture for collecting, analysing and managing website data on a scale of millions of domains. The final project is able to automatically collect data about websites and analyse the content management system they are using.

To be able to do this efficiently, different possible storage back-ends were examined and a system was implemented that is able to gather and store data at a fast pace while still keeping it searchable.

This thesis is a collection of the lessons learned while working on the project combined with the necessary knowledge that went into architectural decisions. It presents an overview of the different infrastructure possibilities and general approaches and as well as explaining the choices that have been made for the implemented system.

From the conclusion:

The implemented architecture has been recorded processing up to 100 domains per second on a single server. At the end of the project the system gathered information about approximately 100 million domains. The collected data can be searched instantly and the automated generation of statistics is visualized in the internal web interface.

Most of your clients have lesser information demands but the lessons here will stand you in good stead with their systems too.

July 15, 2011

Cloudant Search

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 6:49 pm

Cloudant Search

Tim Anglade writes:

I’ve always strongly felt that using NOSQL wasn’t so much a choice as a necessity. That most successful NOSQL deployments start with the intimate knowledge that your set of requirements — from speed & availability to operational considerations and budget — cannot be met with a relational database, coupled with a deep understanding of the tradeoffs you are making. Among those, perhaps no tradeoff has been felt more deeply by NOSQL users worldwide, than the eponymous loss of a natural, instantaneous way of accessing your data through a structured query language. We all came up with our own remedies; more often than not, that substitute was based on MapReduce: Google’s novel, elegant way of explicitly parallelizing computation over distributed, unstructured data. But as the joke goes, it’s always been a non-starter for the more novice users out there, and where suits & ties are involved.

CouchDB Views (as our brand of MapReduce is called) come with additional concerns, as they are pre-computed and written to disk. While this is fine — and actually, extremely useful — for the use-cases and small scales a lot of Apache CouchDB deployments reside at (single instances working off a limited dataset), this behavior is somewhere North of nagging and South of suicidal for the data sizes & use-cases most Cloudant customers have to deal with. Part of the promise of our industry is — or should be, anyway — to make your life & business easier, no matter how much data you have. And so, while CouchDB Views have been, and will undoubtedly remain, an essential tool to index, filter & transform your data, once you know what to do with it; and while its various weaknesses (explicitly parallelized syntax, lengthy computation, heavy disk usage) are also the source of its most meaningful strengths (distributed processing, high performance on repeated queries, persistent transformations), we at Cloudant saw a clear opportunity to offer a novel, complementary way to interact with your data.

A way that would allow you to interact with your data instantaneously; wouldn’t force you to mess around with MapReduce jobs or complex languages; a way that would not require you to set up a third-party, financially or operationally expensive solution.

We call this way Cloudant Search. And today, we’re proud to announce its immediate availability, as a public beta.

Well, there goes the weekend!

July 14, 2011

Elasticsearch, Kettle and the CTools

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 4:11 pm

Elasticsearch, Kettle and the CTools

From the post:

I’m not much into the sql vs nosql discussion. I have enough years of BI to know that the important thing is to choose the right tool for the job. And that requires a lot of tools!

Here’s one more for our set: ElasticSearch. ElasticSearch is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene.

Adds Elasticsearch to Kettle for BI.


Updated 14 May 2012 (forgot the URL for the link, now fixed)

July 1, 2011

Indexing The World Wide Web:…

Filed under: Indexing,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 2:57 pm

Indexing The World Wide Web: The Journey So Far by Abhishek Das and Ankit Jain.

Abstract:

In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.

A non-trivial survey of indexing the web attempts and issues. This is going to take a while to digest but it looks like a very good starting place to uncover what to try next.

Apache Lucene 3.3 / Solr 3.3

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 2:47 pm

Lucene 3.3 Announcement

Lucene Features:

  • The spellchecker module now includes suggest/auto-complete functionality, with three implementations: Jaspell, Ternary Trie, and Finite State.
  • Support for merging results from multiple shards, for both “normal” search results (TopDocs.merge) as well as grouped results using the grouping module (SearchGroup.merge, TopGroups.merge).
  • An optimized implementation of KStem, a less aggressive stemmer for English.
  • Single-pass grouping implementation based on block document indexing.
  • Improvements to MMapDirectory (now also the default implementation returned by FSDirectory.open on 64-bit Linux).
  • NRTManager simplifies handling near-real-time search with multiple search threads, allowing the application to control which indexing changes must be visible to which search requests.
  • TwoPhaseCommitTool facilitates performing a multi-resource two-phased commit, including IndexWriter.
  • The default merge policy, TieredMergePolicy, has a new method (set/getReclaimDeletesWeight) to control how aggressively it targets segments with deletions, and is now more aggressive than before by default.
  • PKIndexSplitter tool splits an index by a mid-point term.

Solr 3.3 Announcement

Solr Features:

  • Grouping / Field Collapsing
  • A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.
  • KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.
  • Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.
  • Important bugfixes, including extremely high RAM usage in spellchecking.
  • Bugfixes and improvements from Apache Lucene 3.3

June 27, 2011

TinySearchEngine

Filed under: Scala,Search Engines — Patrick Durusau @ 6:36 pm

TinySearchEngine

A search engine written in 30 lines of Scala.

Features:

  • in-memory index
  • norms and IDF calculated online
  • default OR operator between query terms
  • index a document per line from a single file
  • read stopwords from a file

June 24, 2011

How to use Scala and Lucene to create a basic search application

Filed under: Lucene,Scala,Search Engines — Patrick Durusau @ 10:45 am

How to use Scala and Lucene to create a basic search application

From the post:

How to use Scala and Lucene to create a basic search application. One of the powerful benefits of Scala is that it has full access to any Java libraries; giving you a tremendous number of available resources and technology. This example doesn’t tap into the full power of Lucene, but highlights how easy it is to incorporate Java libraries into a Scala project.

This example is based off a Twitter analysis app I’ve been noodling on; which I am utilizing Lucene. The code below takes a list of tweets from a text file; creates an index that you can search and extract info from.

Nice way to become familiar both with Scala and Lucene.

SearchBlox Version 6.4 released

Filed under: Search Engines — Patrick Durusau @ 10:44 am

SearchBlox Version 6.4 released

From the post:

SearchBlox V6.4 is now available. This release has a few new features and some important bug fixes.

  • SearchBlox can now automatically detect text files on the files system and index them irrespective of their file extensions. This has been a long standing feature request. You will now be able to use SearchBlox to search across repositories of text files such as source code files and log files. To exclude files with specific file extensions being indexed, the Disallow Filters can be used.
  • The filename of the indexed document is now available as a separate tag in the XML search results
  • HTTPS indexing is now functional in the SearchBlox Server packages
  • Issue with indexing of some MS Office documents is now fixed
  • Foreign characters in search queries using the basic search form works correctly

Detection of text files without relying on file extensions is a welcome improvement!

June 21, 2011

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 7:10 pm

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0 by Simon Willnauer, Apache Lucene PMC.

Abstract:

Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the “next generation” features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It’s final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene’s Codec API for full extendability.

Excellent video!

June 16, 2011

Apache Lucene EuroCon Barcelona

Filed under: Conferences,Lucene,Search Engines — Patrick Durusau @ 3:40 pm

Apache Lucene EuroCon Barcelona

From the webpage:

Apache Lucene EuroCon 2011 is the largest conference for the European Apache Lucene/Solr open source search community. Now in its second year, Apache Lucene Eurocon provides an unparalleled opportunity for European search application developers, thought leaders and market makers to connect and network with their peers and get on board with the technology that’s changing the shape of search: Apache Lucene/Solr.

The conference, taking place in cosmopolitan Barcelona, features a wide range of hands-on technical sessions, spanning the breadth and depth of use cases and technical sessions — plus a complete set of technical training workshops. You will hear from the foremost experts on open source search technology, commiters and developers practiced in the art and science of search. When you’re at Apache Lucene Eurocon, you can…

Even with feel me up security measures at the airport, a trip to Barcelona would be worthwhile anytime. Add a Lucene conference to boot, and who could refuse?

Seriously take advantage of this opportunity to travel this year. Next year, a U.S. presidential election year, will see rumors of security alerts, security alerts, FBI informant sponsored terror plots and the like, which will make travel more difficult.

June 6, 2011

Apache Lucene 3.2 / Solr 3.2

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 1:54 pm

Apache Lucene 3.2 / Solr 3.2 released!

From the website:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:

  • A new grouping module, under lucene/contrib/grouping, enables search results to be grouped by a single-valued indexed field
  • A new IndexUpgrader tool fully converts an old index to the current format.
  • A new Directory implementation, NRTCachingDirectory, caches small segments in RAM, to reduce the I/O load for applications with fast NRT reopen rates.
  • A new Collector implementation, CachingCollector, is able to gather search hits (document IDs and optionally also scores) and then replay them. This is useful for Collectors that require two or more passes to produce results.
  • Index a document block using IndexWriter’s new addDocuments or updateDocuments methods. These experimental APIs ensure that the block of documents will forever remain contiguous in the index, enabling interesting future features like grouping and joins.
  • A new default merge policy, TieredMergePolicy, which is more efficient due to being able to merge non-contiguous segments. See http://s.apache.org/merging for details.
  • NumericField is now returned correctly when you load a stored document (previously you received a normal Field back, with the numeric value converted string).
  • Deleted terms are now applied during flushing to the newly flushed segment, which is more efficient than having to later initialize a reader for that segment.

Highlights of the Solr release include:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format.
  • TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString.
  • Improvements to the UIMA and Carrot2 integrations.
  • Highlighting performance improvements.
  • A test-framework jar for easy testing of Solr extensions.
  • Bugfixes and improvements from Apache Lucene 3.2.
« Newer PostsOlder Posts »

Powered by WordPress