Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 4, 2011

Lucene Scoring API

Filed under: Indexing,Lucene — Patrick Durusau @ 6:34 pm

Lucene Scoring API

The documentation for the Lucene scoring API makes for very interesting reading.

In more ways that one.

Important for anyone who want to understand the scoring of documents by Lucene, which will influence the usefulness of searches in your particular domain.

But I think it is also important because it emphasizes that the scoring is for documents and not subjects.

A very useful thing to score documents, because it (hopefully) puts the most relevant ones to a search at or near the top of search results.

But isn’t that similar to the last mile problem with high speed internet delivery?

That is it is one thing to get high speed internet service to the local switching office. It is quite another to get it to each home, hence, the last mile problem.

An indexing solution like Lucene can, maybe, get you to the right document for a search but that leaves you to go the last mile in terms of finding the subject of interest in the article.

And, just as importantly, relating that subject to other information about the same subject.

True enough, I have been doing that very thing with print indexes and hard copy long before the arrival of widespread full text indexes and on-demand versions of texts.

It seems like a terrible waste of time and resources for everyone interested in a particular subject to have to dig information out of documents and then that cycle is repeated every time someone looks up that subject and finds a particular document.

We all keep running the last semantic mile.

The question is what would motivate us to shorten that to say the last 1/2 semantic mile, or less?

April 1, 2011

Solr 3.1 (Lucene 3.1) Released!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:10 pm

Solr 3.1 (Lucene 3.1) Released!

Solr 3.1, which contains Lucene 3.1, was released on 31 March 2011.

Some of the new features include:

Quick links:

March 27, 2011

Lucene’s FuzzyQuery is 100 times faster in 4.0 (and a topic map tale)

Filed under: Authoring Topic Maps,Lucene,Topic Maps — Patrick Durusau @ 3:16 pm

Lucene’s FuzzyQuery is 100 times faster in 4.0

I first saw this post mentioned in a tweet by Lars Marius Garshol.

From the post:

There are many exciting improvements in Lucene’s eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms “close” to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other “close” terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The story is a good one and demonstrates the need for topic maps in computer science.

The authors used “Googling” to find an implementation by Jean-Phillipe Barrette-LaPierre of an algorithm in a paper by Klaus Schulz and Stoyan Mihov that enabled this increase in performance.

That’s one way to do it, but leaves it to hit or miss as to whether other researchers will find the same implementation.

Moreover, once that connection has been made, the implementation associated with the algorithm/paper, it should be preserved for subsequent searchers.

As well as pointing to the implementation of this algorithm in Lucene, or other implementations, or even other accounts by the same authors, such as the 2004 publication in Computational Linguistics of Fast Approximate Search in Large Dictionaries.

Sounds like a topic map to me. The question is how to make ad hoc authoring of a topic map practical?

Suggestions?

March 13, 2011

Maven-Lucene-Plugin

Filed under: Lucene — Patrick Durusau @ 4:22 pm

Maven-Lucene-Plugin

From the website:

This project is a maven plugin for Apache Lucene. Using it, a Lucene index (configuration inside a xml file) can be created from different datasources ( file/database/xml etc.). A Searcher Util helps in searching the index. Use Lucene without coding.

New project that is looking for volunteers.

Looks like a good way to learn more about Lucene while possibly making a contribution to the community.

March 7, 2011

SearchBlox

Filed under: Lucene — Patrick Durusau @ 7:13 am

SearchBlox

From the website:

SearchBlox is an out-of-the-box Enterprise Search Solution built on top of Apache Lucene. It is fast to deploy, easy to manage and available for both on-premise and cloud deployment.

Best of all, it is free. No limitations. No restrictions.

Do note that downloading SearchBlox did not require registration or surrender of my phone number.

Usual questions, ease of use, ease of integration across deployments, etc.

I am gathering up a rather large data set for other purposes that I will be using to test this and other search software.

Lucid Imagination

Filed under: Lucene,Solr — Patrick Durusau @ 7:12 am

Lucid Imagination

Enterprise level search capabilities based entirely upon Apache Solr/Lucene software.

I had to register to download the installer. Will be installing it later this week/next. Expect posts on how that does.

Was annoying that I had to provide my phone number as part of the registration.

It isn’t like I will forget how to contact them should I encounter a need for their services.

Nor am I likely to be receptive to pesky calls/emails, have you tried our software yet?

I actually have a life separate and apart from the various software packages that I use/evaluate so I tend to work at my schedule.

Two aspects of interest:

First, simply using this as an appliance for indexing/searching in the usual way.

Second, how difficult would it be to leverage that indexing/searching across Lucid installations?

That is two separate and distinct enterprises have used Lucid to index/search mission critical materials that now require merging.

Do we toss the time and experience that went into the separate indexes and build anew? Or can we leverage that investment?

Nhibernate Search Tutorial with Lucene.Net and NHibernate 3.0

Filed under: .Net,Hibernate,Lucene,NHibernate — Patrick Durusau @ 7:11 am

Nhibernate Search Tutorial with Lucene.Net and NHibernate 3.0

From the website:

Here’s another quickstart tutorial on NHibernate Search for NHibernate 3.0 using Lucene.Net. We’re going to be using Fluent NHibernate for NHibernate but attributes for NHibernate Search.

Uses Nhibernate:

NHibernate is a mature, open source object-relational mapper for the .NET framework. It’s actively developed , fully featured and used in thousands of successful projects.

For those of you who are more comfortable in a .Net environment.

March 4, 2011

ApacheCon NA 2011

Filed under: Cassandra,Cloud Computing,Conferences,CouchDB,HBase,Lucene,Mahout,Solr — Patrick Durusau @ 7:17 am

ApacheCon NA 2011

Proposals: Be sure to submit your proposal no later than Friday, 29 April 2011 at midnight Pacific Time.

7-11 November 2011 Vancouver

From the website:

This year’s conference theme is “Open Source Enterprise Solutions, Cloud Computing, and Community Leadership”, featuring dozens of highly-relevant technical, business, and community-focused sessions aimed at beginner, intermediate, and expert audiences that demonstrate specific professional problems and real-world solutions that focus on “Apache and …”:

  • … Enterprise Solutions (from ActiveMQ to Axis2 to ServiceMix, OFBiz to Chemistry, the gang’s all here!)
  • … Cloud Computing (Hadoop, Cassandra, HBase, CouchDB, and friends)
  • … Emerging Technologies + Innovation (Incubating projects such as Libcloud, Stonehenge, and Wookie)
  • … Community Leadership (mentoring and meritocracy, GSoC and related initiatives)
  • … Data Handling, Search + Analytics (Lucene, Solr, Mahout, OODT, Hive and friends)
  • … Pervasive Computing (Felix/OSGi, Tomcat, MyFaces Trinidad, and friends)
  • … Servers, Infrastructure + Tools (HTTP Server, SpamAssassin, Geronimo, Sling, Wicket and friends)

February 22, 2011

Luke

Filed under: Hadoop,Lucene,Maps,Marketing,Search Engines — Patrick Durusau @ 1:34 pm

Luke

From the website:

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • browse by document number, or by term
  • view documents / copy to clipboard
  • retrieve a ranked list of most frequent terms
  • execute a search, and browse the results
  • analyze search results
  • selectively delete documents from the index
  • reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes
  • open indexes consisting of multiple parts, and located on Hadoop filesystem
  • and much more…

Searching is interesting and I have several more search engines to report this week, but the real payoff is finding.

And recording the finding so that other users can benefit from it.

We could all develop our own maps of the London Underground, at the expense of repeating the effort of others.

Or, we can purchase a copy of the London Underground.

Which one seems more cost effective for your organization?

elasticsearch

Filed under: Lucene,NoSQL,Search Engines — Patrick Durusau @ 1:18 pm

elasticsearch

From the website:

So, we build a web site or an application and want to add search to it, and then it hits us: getting search working is hard. We want our search solution to be fast, we want a painless setup and a completely free search schema, we want to be able to index data simply using JSON over HTTP, we want our search server to be always available, we want to be able to start with one machine and scale to hundreds, we want real-time search, we want simple multi-tenancy, and we want a solution that is built for the cloud.

This should be easier“, we declared, “and cool, bonsai cool“.

elasticsearch aims to solve all these problems and more. It is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene.

Another contender in the space for search engines.

Do you have a favorite search engine? If so, what about it makes it your favorite?

February 21, 2011

TF-IDF Weight Vectors With Lucene And Mahout

Filed under: Authoring Topic Maps,Lucene,Mahout — Patrick Durusau @ 6:43 am

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

From the website:

You have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and Mahout can help you to do that almost in a snap.

Why is this important for topic maps?

Wikipedia reports:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. (http://en.wikipedia.org/wiki/Tf-idf, cited in this posting)

Knowing the important terms in a document collection is one step towards a useful topic map. May not be definitive but it is a step in the right direction.

February 19, 2011

Open Search With Lucene & Solr

Filed under: Lucene,Solr — Patrick Durusau @ 4:13 pm

Open Search With Lucene & Solr

Nothing new but a nice overview of Lucene and Lucene based search applications.

For authoring and maintaining topic maps a good knowledge of search engines is essential.

February 17, 2011

HBase and Lucene for realtime search

Filed under: HBase,Lucene — Patrick Durusau @ 6:48 am

HBase and Lucene for realtime search

From the post that starts this exciting thread:

I’m curious as to what a ‘good’ approach would be for implementing search in HBase (using Lucene) with the end goal being the integration of realtime search into HBase. I think the use case makes sense as HBase is realtime and has a write-ahead log, performs automatic partitioning, splitting of data, failover, redundancy, etc. These are all things Lucene does not have out of the box, that we’d essentially get for ‘free’.

For starters: Where would be the right place to store Lucene segments or postings? Eg, we need to be able to efficiently perform a linear iteration of the per-term posting list(s).

Thanks!

Jason Rutherglen

This could definitely have legs for exploring data sets, authoring topic maps or assuming a dynamic synonyms table, composed of conditions for synonymy, even acting as a topic map engine.

Will keep a close eye on this activity.

February 13, 2011

Apache Lucene 3.0 Tutorial

Filed under: Authoring Topic Maps,Indexing,Lucene — Patrick Durusau @ 1:34 pm

Apache Lucene 3.0 Tutorial by Bob Carpenter.

At 20 pages it isn’t your typical “Hello World” introduction. 😉

It should be the first document you hand a semi-technical person about Lucene.

Discovering the vocabulary of the documents/domain for which you are building a topic map is a critical first step.

Indexing documents gives you an important control over the accuracy and completeness of information you are given by domain “experts” and users.

There will be terms that are transparent to them and can only be clarified if you ask.

January 6, 2011

Lucene and Solr: 2010 in Review – Post

Filed under: Lucene,Search Engines,Solr — Patrick Durusau @ 2:55 pm

Lucene and Solr: 2010 in Review

Great highlights of a busy and productive year for both Lucene and Solr.

December 11, 2010

Sensei

Filed under: Indexing,Lucene,NoSQL — Patrick Durusau @ 3:35 pm

Sensei

From the website:

Sensei is a distributed database that is designed to handle the following type of query:


SELECT f1,f2…fn FROM members
WHERE c1 AND c2 AND c3.. GROUP BY fx,fy,fz…
ORDER BY fa,fb…
LIMIT offset,count

Relies on zoie and hence Lucene for indexing.

Another comparison for the development of TMQL, which of course will need to address semantic sameness.

December 7, 2010

Bobo: Fast Faceted Search With Lucene

Filed under: Facets,Information Retrieval,Lucene,Navigation,Subject Identity — Patrick Durusau @ 8:52 pm

Bobo: Fast Faceted Search With Lucene

From the website:

Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene.

While Lucene is good with unstructured data, Bobo fills in the missing piece to handle semi-structured and structured data.

Bobo Browse is an information retrieval technology that provides navigational browsing into a semi-structured dataset. Beyond the result set from queries and selections, Bobo Browse also provides the facets from this point of browsing.

Features:

  • No need for cache warm-up for the system to perform
  • multi value sort – sort documents on fields that have multiple values per doc, .e.g tokenized fields
  • fast field value retrieval – over 30x faster than IndexReader.document(int docid)
  • facet count distribution analysis
  • stable and small memory footprint
  • support for runtime faceting
  • result merge library for distributed facet search

I had to go look up the definition of facet. Merriam-Webster (I remember when it was just Webster) says:

any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)

So a faceted search could search/browse, in theory at any rate, based on any property of a subject, even those I don’t recognize.

Different languages being the easiest example.

I could have aspects of a hotel room described in both German and Korean, both describing the same facets of the room.

Questions:

  1. How would you choose the facets for a subject to be included in faceted browsing? (3-5 pages, no citations)
  2. How would you design and test the presentation of facets to users? (3-5 pages, no citations)
  3. Compare the current TMQL proposal (post-Barta) with the query language for facet searching. If a topic map were treated (post-merging) as faceted subjects, which one would you prefer and why? (3-5 pages, no citations)

December 4, 2010

Zoie: Real-time search indexing

Filed under: Full-Text Search,Indexing,Lucene,Search Engines,Software — Patrick Durusau @ 10:04 am

Zoie: Real-time search indexing

Somehow appropriate that following the lead on Kafka would lead me to Zoie (and other goodies to be reported).

From the website:

Zoie is a real-time search and indexing system built on Apache Lucene.

Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

News: Zoie 2.0.0 is released … – Compatible with Lucene 2.9.x.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Design Goals:

  • Additions of documents must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions of documents must not fragment the index (which hurts search performance)
  • Deletes and/or updates of documents must not affect search performance.

In topic map terms:

  • Additions to topic map must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions to topic map must not fragment the index (which hurts search performance)
  • Deletes and/or updates of a topic map must not affect search performance.

I would say that #’s 3 and 4 are research questions at this point.

Additions, updates and deletions in a topic map may have unforeseen (unforeseeable?) consequences.

Such as causing:

  • merging to occur
  • merging to be undone
  • roles to be played
  • roles to not be played
  • association to be valid
  • association to be invalid

to name only a few.

It may be possible to formally prove the impact that certain events will have but I am not aware of any definitive analysis on the subject.

Suggestions?

November 29, 2010

Lucene / Solr for Academia: PhD Thesis Ideas
(Disambiguated Indexes/Indexing)

Filed under: Lucene,Software,Solr — Patrick Durusau @ 5:57 am

Lucene / Solr for Academia: PhD Thesis Ideas

Excellent opportunity to make suggestions that could result not only in more academic work but also in advancement of useful open source software.

My big idea (don’t mind if you borrow/steal it for implementation):

We all know how traditional indexes work. They gather up single tokens and then point back to the locations in documents where they are found.

So they can’t distinguish among differing uses of the same string. One aspect of the original indexing problem that lead to topic maps.

What if indexers could be given configuration files that said: When indexing http://www.medicalsite.org create a tuple for indexing that includes site=www.medicalsite.org type=medical, etc. and index the entire tuple as a single entry.

And enable indexers to index by members of the tuples so that if I decide that all uses of a term of type=medical mean the same subject, I can produce an index that represents that choice.

Sounds a lot like merging doesn’t it?

I don’t know of any index that does what I just described but I don’t know all indexes so if I have overlooked something, please sing out.

If successful, would be an entirely different way of authoring topic maps against large information stores.

Not to mention creating the opportunity to monetize indexes as separate from the information resources themselves. The Readers Guide to Periodical Literature is a successful example of that approach as product.

Hmmm, needs a name, how about: Disambiguated Indexes/Indexing?

Suggestions?

November 25, 2010

Sig.ma – Live views on the Web of Data

Filed under: Indexing,Information Retrieval,Lucene,Mapping,RDF,Search Engines,Semantic Web — Patrick Durusau @ 10:27 am

Sig.ma – Live views on the Web of Data

From the website:

In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.

Read one of various versions of an article on Sig.ma for the technical details.

From the Web Technologies article cited on the homepage:

Sig.ma revolves around the creation of Entity Profiles. An entity profile – which in the Sig.ma dataflow is represented by the “data cache” storage (Fig. 3) – is a summary of an entity that is presented to the user in a visual interface, or which can be returned by the API as a rich JSON object or a RDF document. Entity profiles usually include information that is aggregated from more than one source. The basic structure of an entity profile is a set of key-value pairs that describe the entity. Entity profiles often refer to other entities, for example the profile of a person might refer to their publications.

No, this isn’t an implementation of the TMRM.

This is an implementation of one way to view entities for a particular type of data. A very exciting one but still limited to a particular data set.

This is a big step forward.

For example, it isn’t hard to imagine entity profiles against particular websites or data sets. Entity profiles that are maintained and leased for use with search engines like Sig.ma.

Or going a bit further and declaring a basis for identification of subjects, such as the existence of properties a…n in an RDF graph.

Questions:

  1. Spend a couple of hours with Sig.ma researching library related questions. (Discussion)
  2. What did you like, dislike or find surprising about Sig.ma? (3-5 pages, no citations)
  3. Entity profiles for library science (Class project)

Sig.ma: Live Views on the web of data – bibliography issues

I normally start with a DOI here so you can see article in question.

Not here.

Here’s why:

Sig.ma: Live views on the Web of Data Journal of Web Semantics. (10 pages)

Sig.ma: Live Views on the Web of Data WWW ’10 Proceedings(demo, 4 pages)

Sig.ma: Live Views on the Web of Data (8 pages) http://richard.cyganiak.de/2008/papers/sigma-semwebchallenge2009.pdf

Sig.ma: Live Views on the Web of Data (4 pages) http://richard.cyganiak.de/2008/papers/sigma-demo-www2010.pdf

Sig.ma: Live Views on the Web of Data (25 pages) http://fooshed.net/paper/JWS2010.pdf

Before saying anything ugly, ;-), this is some of the most exciting research I have seen in a long time. I will cover that part of it in a following post. But, to the matter at hand, bibliographic control.

Five (5) different articles, two published in recognized journals that all have the same name? (The demo articles are the same but have different headers/footers, page numbers and so would likely be indexed as different articles.)

I will be able to resolve any confusion by obtaining the article in question.

But that isn’t an excuse.

I, along with everyone else interested in this research, will waste a small part of our time resolving the confusion. Confusion that could have been avoided for everyone.

Not unlike everyone who does the same search having to tread the same google glut.

With no way to pass on what we have resolved, for the benefit of others.

Questions:

  1. Help these authors out. How would you suggest they avoid this in the future? Use of the name is important. (3-5 pages, no citations)
  2. Help the library out. How will you deal with multiple papers with the same title, authors, pub year? (this isn’t uncommon) (3-5 pages, citations optional)
  3. How would you use topic maps to resolve this issue? (3-5 pages, no citations)

A Node Indexing Scheme for Web Entity Retrieval

Filed under: Entity Resolution,Full-Text Search,Indexing,Lucene,RDF,Topic Maps — Patrick Durusau @ 6:15 am

A Node Indexing Scheme for Web Entity Retrieval Authors(s): Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello Keywords: entity, entity search, full-text search, semi-structured queries, top-k query, node indexing, incremental index updates, entity retrieval system, RDF, RDFa, Microformats

Abstract:

Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.

Consider the requirements for this project:

  1. Support for the multiple formats which are used on the Web of Data;
  2. Support for searching an entity description given its characteristics (entity centric search);
  3. Support for context (provenance) of information: entity descriptions are given in the context of a website or a dataset;
  4. Support for semi-structural queries with full-text search, top-k query results, scalability over shard clusters of commodity machines, efficient caching strategy and incremental index maintenance.
    1. (emphasis added)

SIREn { Semantic Information Retrieval Engine }

Definitely a package to download, install and start to evaluate. More comments forthcoming.

Questions (more for topic map researchers)

  1. To what extent can “entity description” = properties of topics, associations, occurrences?
  2. Can XTM, etc., be regarded as “microformats” for the purposes of SIREn?
  3. To what extent does SIREn meet or exceed query requirements for XTM/TMDM based topic maps?
  4. Reports on use of SIREn by topic mappers?

November 10, 2010

« Newer Posts

Powered by WordPress