Indexing « Another Word For It

June 21, 2013

::MG4J: Managing Gigabytes for Java™

Filed under: Indexing,MG4J,Search Engines — Patrick Durusau @ 4:43 pm

From the webpage:

Release 5.0 has several source and binary incompatibilities, and introduces quasi-succinct indices[broken link]. Benchmarks on the performance of quasi-succinct indices can be found here; for instance, this table shows the number of seconds to answer 1000 multi-term queries on a document collection of 130 million web pages:

MG4J MG4J* Lucene 3.6.2

Terms 70.9 132.1 130.6

And 27.5 36.7 108.8

Phrase 78.2 — 127.2

Proximity 106.5 — 347.6

Both engines were set to just enumerate the results without scoring. The column labelled MG4J* gives the timings of an artificially modified version in which counts for each retrieved document have been read (MG4J now stores document pointers and counts in separate files, but Lucene interleaves them, so it has to read counts compulsorily). Proximity queries are conjunctive queries that must be satisfied within a window of 16 words. The row labelled “Terms” gives the timings for enumerating the posting lists of all terms appearing in the queries.

	MG4J	MG4J*	Lucene 3.6.2
Terms	70.9	132.1	130.6
And	27.5	36.7	108.8
Phrase	78.2	—	127.2
Proximity	106.5	—	347.6

I tried the link for “quasi-succinct indices” and it consistently returns a 404.

In lieu of that reference, see: Quasi-Succinct Indices by Sebastiano Vigna.

Abstract:

Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.

Heavy sledding but with search results as shown from the benchmark, well worth the time to master.

Comments Off

June 18, 2013

Lux (0.9 – New Release)

Filed under: Indexing,Lux,XML,XQuery — Patrick Durusau @ 12:46 pm

Lux – The XML Search Engine

From the webpage:

Lux is an open source XML search engine formed by fusing two excellent technologies: the Apache Lucene/Solr search index and the Saxon XQuery/XSLT processor.

Release notes for 0.9 (released today)

This looks quite promising!

Comments Off

June 15, 2013

Indexing web sites in Solr with Python

Filed under: Indexing,Python,Solr — Patrick Durusau @ 3:44 pm

Indexing web sites in Solr with Python by Martijn Koster.

From the post:

In this post I will show a simple yet effective way of indexing web sites into a Solr index, using Scrapy and Python.

We see a lot of advanced Solr-based applications, with sophisticated custom data pipelines that combine data from multiple sources, or that have large scale requirements. Equally we often see people who want to start implementing search in a minimally-invase way, using existing websites as integration points rather than implementing a deep integration with particular CMSes or databases which may be maintained by other groups in an organisation. While crawling websites sounds fairly basic, you soon find that there are gotchas, with the mechanics of crawling, but more importantly, with the structure of websites. If you simply parse the HTML and index the text, you will index a lot of text that is not actually relevant to the page: navigation sections, headers and footers, ads, links to related pages. Trying to clean that up afterwards is often not effective; you’re much better off preventing that cruft going into the index in the first place. That involves parsing the content of the web page, and extracting information intelligently. And there’s a great tool for doing this: Scrapy. In this post I will give a simple example of its use. See Scrapy’s tutorial for an introduction and further information.

Good practice with Solr, not to mention your search activities are yours to keep private if you like. 😉

Comments Off

June 11, 2013

Orthogonal Range Searching for Text Indexing

Filed under: Indexing,Text Mining — Patrick Durusau @ 10:32 am

Orthogonal Range Searching for Text Indexing by Moshe Lewenstein.

Abstract:

Text indexing, the problem in which one desires to preprocess a (usually large) text for future (shorter) queries, has been researched ever since the sux tree was invented in the early 70’s. With textual data continuing to increase and with changes in the way it is accessed, new data structures and new algorithmic methods are continuously required. Therefore, text indexing is of utmost importance and is a very active research domain.

Orthogonal range searching, classically associated with the computational geometry community, is one of the tools that has increasingly become important for various text indexing applications. Initially, in the mid 90’s there were a couple of results recognizing this connection. In the last few years we have seen an increase in use of this method and are reaching a deeper understanding of the range searching uses for text indexing.

From the paper:

Orthogonal range searching refers to the preprocessing of a collection of points in d-dimensional space to allow queries on ranges dened by rectangles whose sides are aligned with the coordinate axes (orthogonal).

If you are not already familiar with this area, you may find Lecture 11: Orthogonal Range Searching useful.

In a very real sense, indexing, as in a human indexer, lies at the heart of topic maps.

A human indexer recognizes synonyms, relationships represented by synonyms and distinguishes other uses of identifiers.

Topic maps are an effort to record that process so it can be followed mechanically by a calculator.

Mechanical indexing is a powerful tool in the hands of a human indexer, whether working on a traditional index or its successor, a topic map.

What type of mechanical indexing are you using?

Comments Off

June 10, 2013

Advanced Suggest-As-You-Type With Solr

Filed under: Indexing,Searching,Solr — Patrick Durusau @ 10:10 am

Advanced Suggest-As-You-Type With Solr by John Berryman.

From the post:

In my previous post, I talked about implementing Search-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.

Here’s the use case: A user comes to your search-driven website to find something. And it is your goal to be as helpful as possible. Part of this is by making term suggestions as they type. When you make these suggestions, it is critical to make sure that your suggestion leads to search results. If you make a suggestion of a word just because it is somewhere in your index, but it is inconsistent with the other terms that the user has typed, then the user is going to get a results page full of white space and you’re going to get another dissatisfied customer!

A lot of search teams jump at the Solr suggester component because, after all, this is what it was built for. However I haven’t found a way to configure the suggester so that it suggests only completions that that correspond to search results. Rather, it is based upon a dictionary lookup that is agnostic of what the user is currently searching for. (Please someone tell me if I’m wrong!) In any case, getting the suggester working takes a bit of configuration. — Why not use a solution that is based upon the normal, out-of-the-box Solr setup. Here’s how:

Topic map authoring is what jumps to my mind as a use case for suggest-as-you-type. Particularly if you use fields for particular types of topics, making the suggestions more focused and manageable.

Good for search as well, for the same reasons.

John offers several cautions near the end of his post, but the final one is quite amusing:

Inappropriate Content: Be very cautious about content of the fields being used for suggestions. For instance, if the content has misspellings, so will the suggestions. And don’t include user comments unless you want to endorse their opinions and choice of language as your search suggestions!

I don’t think of auto-suggestions as “endorsing” anything. Purely mechanical assistance to assist the user.

If some term or opinion offends a user, they don’t have to choose it to follow.

At least in my view, technology should not be used to build intellectual tombs for users. Intellectual tombs that protect them from thoughts or expressions different from their own.

Comments Off

June 9, 2013

Build Your Own Lucene Codec!

Filed under: Indexing,Lucene — Patrick Durusau @ 3:09 pm

Build Your Own Lucene Codec! by Doug Turnbull.

From the post:

I’ve been having a lot of fun hacking on a Lucene Codec lately. My hope is to create a Lucene storage layer based on FoundationDB – a new distributed and transactional key-value store. It’s a fun opportunity to learn about both FoundationDB and low-level Lucene details.

But before we get into all that fun technical stuff, there’s some work we need to do. Our goal is going to be to get MyFirstCodec to work! Here’s the source code:

(…)

From the Lucene 4.1 documentation: Codec – Class in org.apache.lucene.codecs Encodes/decodes an inverted index segment.

How good do you want to be with your tools?

Comments Off

May 24, 2013

How Does A Search Engine Work?…

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 5:59 pm

How Does A Search Engine Work? An Educational Trek Through A Lucene Postings Format by Doug Turnbull.

From the post:

A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form.

The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown around word in the Lucene space. A Postings format is the representation of the inverted search index – the core data structure used to lookup documents that contain a term. I think nothing really captures the logical look-and-feel of Lucene’s postings better than Mike McCandless’s SimpleTextPostingsFormat. SimpleText is a text-based representation of postings created for educational purposes. I’ve indexed a few documents in Lucene using SimpleText to demonstrate how postings are structured to allow for fast search:

A first step towards moving beyond being a search engine result consumer.

Comments Off

May 22, 2013

Dynamic faceting with Lucene

Filed under: Faceted Search,Facets,Indexing,Lucene,Search Engines — Patrick Durusau @ 2:08 pm

Dynamic faceting with Lucene by Michael McCandless.

From the post:

Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

The dynamic range faceting sounds particularly useful.

Comments Off

May 20, 2013

The Index-Based Subgraph Matching Algorithm (ISMA)…

Filed under: Bioinformatics,Graphs,Indexing — Patrick Durusau @ 4:23 pm

The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees by Sofie Demeyer, Tom Michoel, Jan Fostier, Pieter Audenaert, Mario Pickavet, Piet Demeester. (Demeyer S, Michoel T, Fostier J, Audenaert P, Pickavet M, et al. (2013) The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees. PLoS ONE 8(4): e61183. doi:10.1371/journal.pone.0061183)

Abstract:

Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at http://sourceforge.net/projects/isma/.

From the introduction:

Over the last decade, network theory has come to play a central role in our understanding of complex systems in fields as diverse as molecular biology, sociology, economics, the internet, and others [1]. The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks [2]. Network motifs act as the fundamental information processing units in cellular regulatory networks [3] and they form the building blocks of larger functional modules (also known as network communities) [4]–[6]. The discovery and analysis of network motifs crucially depends on the ability to enumerate all instances of a given query subgraph in a network or graph of interest, a classical problem in pattern recognition [7], that is known to be NP complete [8].

Heavy sledding but important for exploration of large graphs/networks and the subsequent representation of those findings in a topic map.

I first saw this in Nat Torkinton’s Four short links: 13 May 2013.

Comments Off

May 14, 2013

Labels and Schema Indexes in Neo4j

Filed under: Cypher,Indexing,Neo4j — Patrick Durusau @ 9:24 am

Labels and Schema Indexes in Neo4j by Tareq Abedrabbo.

From the post:

Neo4j recently introduced the concept of labels and their sidekick, schema indexes. Labels are a way of attaching one or more simple types to nodes (and relationships), while schema indexes allow to automatically index labelled nodes by one or more of their properties. Those indexes are then implicitly used by Cypher as secondary indexes and to infer the starting point(s) of a query.

I would like to shed some light in this blog post on how these new constructs work together. Some details will be inevitably specific to the current version of Neo4j and might change in the future but I still think it’s an interesting exercise.

Before we start though I need to populate the graph with some data. I’m more into cartoon for toddlers than second-rate sci-fi and therefore Peppa Pig shall be my universe.

So let’s create some labeled graph resources.

Nice review of the impact of the new label + schema index features in Neo4j.

I am still wondering why Neo4j “simple types” cannot be added to nodes and edges without the additional machinery of labels?

Allow users to declare properties to be indexed and used by Cypher for queries.

Which creates a generalized mechanism that requires no changes to the data model.

I have a question pending with the Neo4j team on this issue and will report back with their response.

Comments Off

May 8, 2013

How Impoverished is the “current world of search?”

Filed under: Context,Indexing,Search Engines,Searching — Patrick Durusau @ 12:34 pm

Internet Content Is Looking for You

From the post:

Where you are and what you’re doing increasingly play key roles in how you search the Internet. In fact, your search may just conduct itself.

This concept, called “contextual search,” is improving so gradually the changes often go unnoticed, and we may soon forget what the world was like without it, according to Brian Proffitt, a technology expert and adjunct instructor of management in the University of Notre Dame’s Mendoza College of Business.

Contextual search describes the capability for search engines to recognize a multitude of factors beyond just the search text for which a user is seeking. These additional criteria form the “context” in which the search is run. Recently, contextual search has been getting a lot of attention due to interest from Google.

(…)

“You no longer have to search for content, content can search for you, which flips the world of search completely on its head,” says Proffitt, who is the author of 24 books on mobile technology and personal computing and serves as an editor and daily contributor for ReadWrite.com.

“Basically, search engines examine your request and try to figure out what it is you really want,” Proffitt says. “The better the guess, the better the perceived value of the search engine. In the days before computing was made completely mobile by smartphones, tablets and netbooks, searches were only aided by previous searches.

(…)

Context can include more than location and time. Search engines will also account for other users’ searches made in the same place and even the known interests of the user.

If time and location plus prior searches is context that “…flips the world of search completely on its head…”, imagine what a traditional index must do.

A traditional index being created by a person who has subject matter knowledge beyond the average reader and so is able to point to connections and facts (context) previously unknown to the user.

The “…current world of search…” is truly impoverished for time and location to have that much impact.

Comments Off

May 6, 2013

You Say Beowulf, I Say Biowulf [Does Print Shape Digital?]

Filed under: Indexing,Library,Manuscripts — Patrick Durusau @ 5:58 pm

You Say Beowulf, I Say Biowulf by Julian Harrison.

From the post:

Students of medieval manuscripts will know that it’s always instructive to consult the originals, rather than to rely on printed editions. There are many aspects of manuscript culture that do not translate easily onto the printed page — annotations, corrections, changes of scribe, the general layout, the decoration, ownership inscriptions.

Beowulf is a case in point. Only one manuscript of this famous Old English epic poem has survived, which is held at the British Library (Cotton MS Vitellius A XV). The writing of this manuscript was divided between two scribes, the first of whom terminated their stint with the first three lines of f. 175v, ending with the words “sceaden mæl scyran”; their counterpart took over at this point, implying that an earlier exemplar lay behind their text, from which both scribes copied.

(…)

Another distinction between those two scribes, perhaps less familiar to modern students of the text, is the varying way in which they spell the name of the eponymous hero Beowulf. On 40 occasions, Beowulf’s name is spelt in the conventional manner (the first is found in line 18 of the standard editions, the last in line 2510). However, in 7 separate instances, the name is instead spelt “Biowulf” (“let’s call the whole thing off), the first case coming in line 1987 of the poem.

I think you will enjoy the post, to say nothing of the images of the manuscript.

My topic map concern is with:

There are many aspects of manuscript culture that do not translate easily onto the printed page — annotations, corrections, changes of scribe, the general layout, the decoration, ownership inscriptions.

I take it that non-facsimile publication in print loses some of the richness of the manuscript.

My question is: To what extent have we duplicated the limitations of print media in digital publications?

For example, a book may have more than one index, but not more than one index of the same kind.

That is you can’t find a book that has multiple distinct subject indexes. Not surprising considering the printing cost of duplicate subject indexes, but we don’t have that limitation with electronic indexes.

Or do we?

In my experience anyway, electronic indexes mimic their print forefathers. Each electronic index stands on its own, even if each index is of the same work.

Assume we have a Spanish and English index, for the casual reader, to the plays of Shakespeare. Even in electronic form, I assume they would be created and stored as separate indexes.

But isn’t that simply replicating what we would experience with a print edition?

Can you think of other cases where our experience with print media has shaped our choices with digital artifacts?

Comments Off

April 29, 2013

Indexing Millions Of Documents…

Filed under: Indexing,Solr,Tika — Patrick Durusau @ 2:13 pm

Indexing Millions Of Documents Using Tika And Atomic Update by Patricia Gorla.

From the post:

On a recent engagement, we were posed with the problem of sorting through 6.5 million foreign patent documents and indexing them into Solr. This totaled about 1 TB of XML text data alone. The full corpus included an additional 5 TB of images to incorporate into the index; this blog post will only cover the text metadata.

Streaming large volumes of data into Solr is nothing new, but this dataset posed a unique challenge: Each patent document’s translation resided in a separate file, and the location of each translation file was unknown at runtime. This meant that for every document processed we wouldn’t know where its match would be. Furthermore, the translations would arrive in batches, to be added as they come. And lastly, the project needed to be open to different languages and different file formats in the future.

Our options for dealing with inconsistent data came down to: cleaning all data and organizing it before processing, or building an ingester robust enough to handle different situations.

We opted for the latter and built an ingester that would process each file individually and index the documents with an atomic update (new in Solr 4). To detect and extract the text metadata we chose Apache Tika. Tika is a document-detection and content-extraction tool useful for parsing information from many different formats.

On the surface Tika offers a simple interface to retrieve data from many sources. Our use case, however, required a deeper extraction of specific data. Using the built-in SAX parser allowed us to push Tika beyond its normal limits, and analyze XML content according to the type of information it contained.

No magic bullet but an interesting use case (patents in multiple languages).

Comments Off

April 19, 2013

Broccoli: Semantic Full-Text Search at your Fingertips

Filed under: Indexing,Search Algorithms,Search Engines,Semantic Search — Patrick Durusau @ 4:49 pm

Broccoli: Semantic Full-Text Search at your Fingertips by Hannah Bast, Florian Bäurle, Björn Buchhold, Elmar Haussmann.

Abstract:

We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the EnglishWikipedia (32 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas, see http://broccoli.informatik.uni-freiburg.de.

The most impressive part of an impressive paper was the new index, context lists.

The second idea, which is the main idea behind our new index, is to have what we call context lists instead of inverted lists. The context list for a prex contains one index item per occurrence of a word starting with that prex, just like the inverted list for that prex would. But along with that it also contains one index item for each occurrence of an arbitrary entity in the same context as one of these words.

The performance numbers speak for themselves.

This should be a feature in the next release of Lucene/Solr. Or perhaps even configurable for the number of entities that can appear in a “context list.”

Was it happenstance or a desire for simplicity that caused the original indexing engines to parse text into single tokens?

Literature references on that point?

Comments Off

April 6, 2013

Ultimate library challenge: taming the internet

Filed under: Data,Indexing,Preservation,Search Data,WWW — Patrick Durusau @ 3:40 pm

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

Comments Off

Indexing PDF for OSINT and Pentesting [or Not!]

Filed under: Cybersecurity,Indexing,PDF,Security,Solr — Patrick Durusau @ 9:11 am

Indexing PDF for OSINT and Pentesting by Alejandro Nolla.

From the post:

Most of us, when conducting OSINT tasks or gathering information for preparing a pentest, draw on Google hacking techniques like site:company.acme filetype:pdf “for internal use only” or something similar to search for potential sensitive information uploaded by mistake. At other times, a customer will ask us to find out if through negligence they have leaked this kind of sensitive information and we proceed to make some google hacking fu.

But, what happens if we don’t want to make this queries against Google and, furthermore, follow links from search that could potentially leak referrers? Sure we could download documents and review them manually in local but it’s boring and time consuming. Here is where Apache Solr comes into play for processing documents and creating an index of them to give us almost real time searching capabilities.

A nice outline of using Solr for internal security testing of PDF files.

At the same time, a nice outline of using Solr for external security testing of PDF files. 😉

You can sweep sites for new PDF files on a periodic basis and retain only those meeting a particular criteria.

Low grade ore but even low grade ore can have a small diamond every now and again.

Comments Off

March 30, 2013

ElasticSearch: Text analysis for content enrichment

Filed under: ElasticSearch,Indexing,Search Engines,Searching — Patrick Durusau @ 6:15 pm

ElasticSearch: Text analysis for content enrichment by Jaibeer Malik.

From the post:

Taking an example of a typical eCommerce site, serving the right content in search to the end customer is very important for the business. The text analysis strategy provided by any search solution plays very big role in it. As a search user, I would prefer some of typical search behavior for my query to automatically return,

should look for synonyms matching my query text

should match singluar and plural words or words sounding similar to enter query text

should not allow searching on protected words

should allow search for words mixed with numberic or special characters

should not allow search on html tags

should allow search text based on proximity of the letters and number of matching letters

Enriching the content here would be to add above search capabilities to you content while indexing and searching for the content.

I thought the “…look for synonyms matching my query text…” might get your attention. 😉

Not quite a topic map because there isn’t any curation of the search results, saving the next searcher time and effort.

But in order to create and maintain a topic map, you are going to need expansion of your queries by synonyms.

You will take the results of those expanded queries and fashion them into a topic map.

Think of it this way:

Machines can rapidly harvest, even sort content at your direction.

What they can’t do is curate the results of their harvesting.

That requires a secret ingredient.

That would be you.

I first saw this at DZone.

Comments Off

March 28, 2013

Build a search engine in 20 minutes or less

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 7:15 pm

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.

Enjoy!

PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

Comments Off

March 25, 2013

Master Indexing and the Unified View

Filed under: Indexing,Master Data Management,Record Linkage — Patrick Durusau @ 4:32 pm

Master Indexing and the Unified View by David Loshin.

From the post:

1) Identity resolution – The master data environment catalogs the set of representations that each unique entity exhibits in the original source systems. Applying probabilistic aggregation and/or deterministic rules allows the system to determine that the data in two or more records refers to the same entity, even if the original contexts are different.

2) Data quality improvement – Linking records that share data about the same real-world entity enable the application of business rules to improve the quality characteristics of one or more of the linked records. This doesn’t specifically mean that a single “golden copy” record must be created to replace all instances of the entity’s data. Instead, depending on the scenario and quality requirements, the accessibility of the different sources and the ability to apply those business rules at the data user’s discretion will provide a consolidated view that best meets the data user’s requirements at the time the data is requested.

3) Inverted mapping – Because the scope of data linkage performed by the master index spans the breadth of both the original sources and the collection of data consumers, it holds a unique position to act as a map for a standardized canonical representation of a specific entity to the original source records that have been linked via the identity resolution processes.

In essence this allows you to use a master data index to support federated access to original source data while supporting the application of data quality rules upon delivery of the data.

It’s been a long day but does David’s output have all the attributes of a topic map?

Identity resolution – Two or more representatives the same subject
Data quality improvement – Consolidated view of the data based on a subject and presented to the user
Inverted mapping – Navigation based on a specific entity into original source records

Comments?

Comments Off

March 22, 2013

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar)

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 10:34 am

Lucene/Solr 4 – A Revolution in Enterprise Search Technology (Webinar). Presenter: Erik Hatcher, Lucene/Solr Committer and PMC member.

Date: Wednesday, March 27, 2013
Time: 10:00am Pacific Time

From the signup page:

Lucene/Solr 4 is a ground breaking shift from previous releases. Solr 4.0 dramatically improves scalability, performance, reliability, and flexibility. Lucene 4 has been extensively upgraded. It now supports near real-time (NRT) capabilities that allow indexed documents to be rapidly visible and searchable. Additional Lucene improvements include pluggable scoring, much faster fuzzy and wildcard querying, and vastly improved memory usage.

The improvements in Lucene have automatically made Solr 4 substantially better. But Solr has also been considerably improved and magnifies these advances with a suite of new “SolrCloud” features that radically improve scalability and reliability.

In this Webinar, you will learn:

What are the Key Feature Enhancements of Lucene/Solr 4, including the new distributed capabilities of SolrCloud

How to Use the Improved Administrative User Interface

How Sharding has been improved

What are the improvements to GeoSpatial Searches, Highlighting, Advanced Query Parsers, Distributed search support, Dynamic core management, Performance statistics, and searches for rare values, such as Primary Key

Great way to get up to speed on the latest release of Lucene/Solr!

Comments Off

March 15, 2013

Using Solr’s New Atomic Updates

Filed under: Indexing,Solr — Patrick Durusau @ 4:08 pm

Using Solr’s New Atomic Updates by Scott Stults.

From the post:

A while ago we created a sample index of US patent grants roughly 700k documents big. Adjacently we pulled down the corresponding multi-page TIFFs of those grants and made PNG thumbnails of each page. So far, so good.

You see, we wanted to give our UI the ability to flip through those thumbnails and we wanted it to be fast. So our original design had a client-side function that pulled down the first thumbnail and then tried to pull down subsequent thumbnails until it ran out of pages or cache. That was great for a while, but it didn’t scale because a good portion of our requests were for non-existent resources.

Things would be much better if the UI got the page count along with the other details of the search hits. So why not update each record in Solr with that?

A new feature in Solr and one that I suspect will be handy. Such as updating a index of associations, for example.

Comments Off

A Peek Behind the Neo4j Lucene Index Curtain

Filed under: Indexing,Lucene,Neo4j — Patrick Durusau @ 4:02 pm

A Peek Behind the Neo4j Lucene Index Curtain by Max De Marzi.

Max suggests using a copy of your Neo4j database for this exercise.

Could be worth your while to go exploring.

And you will learn something about Lucene in the bargain.

Comments Off

March 11, 2013

Learning Hash Functions Using Column Generation

Filed under: Hashing,Indexing,Similarity — Patrick Durusau @ 6:11 am

Learning Hash Functions Using Column Generation by Xi Li, Guosheng Lin, Chunhua Shen, Anton van den Hengel, Anthony Dick.

Abstract:

Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

Interesting review of hashing techniques.

Raises the question of customized similarity (read sameness) hashing algorithms for topic maps.

I first saw this in a tweet by Stefano Bertolo.

Comments Off

March 4, 2013

A.nnotate

Filed under: Annotation,Indexing — Patrick Durusau @ 3:29 pm

A.nnotate

From the homepage:

A.nnotate is an online annotation, collaboration and indexing system for documents and images, supporting PDF, Word and other document formats. Instead of emailing different versions of a document back and forth you can now all comment on a single read-only copy online. Documents are displayed in high quality with fonts and layout just like the printed version. It is easy to use and runs in all common web browsers, with no software or plugins to install.

Hosted solutions are available for individuals and workgroups. For enterprise users the full system is available for local installation. Special discounts apply for educational use. A.nnotate technology can also be used to enhance existing document and content management systems with high quality online document viewing, annotation and collaboration facilities.

I suppose that is one way to solve the “index merging” problem.

Everyone use a common document.

Doesn’t help if a group starts with different copies of the same document.

Or if other indexes from other documents need to be merged with the present document.

Not to mention merging indexes/annotations separate from any particular document instance.

Still, a step away from the notion of a document as a static object.

Which is a good thing.

I first saw this in a tweet by Stian Danenbarger.

Comments Off

March 1, 2013

MongoDB + Fractal Tree Indexes = High Compression

Filed under: Fractal Trees,Indexing,MongoDB,Requirements — Patrick Durusau @ 5:31 pm

MongoDB + Fractal Tree Indexes = High Compression by Tim Callaghan.

You may have heard that MapR Technologies broke the MinuteSort Record by sorting 15 billion 100-btye records in 60 seconds. Used 2,103 virtual instances in the Google Compute Engine and each instance had four virtual cores and one virtual disk, totaling 8,412 virtual cores and 2,103 virtual disks. Google Compute Engine, MapR Break MinuteSort Record.

So, the next time you have 8,412 virtual cores and 2,103 virtual disks, you know what is possible, 😉

But if you have less firepower than that, you will need to be clever:

One doesn’t have to look far to see that there is strong interest in MongoDB compression. MongoDB has an open ticket from 2009 titled “Option to Store Data Compressed” with Fix Version/s planned but not scheduled. The ticket has a lot of comments, mostly from MongoDB users explaining their use-cases for the feature. For example, Khalid Salomão notes that “Compression would be very good to reduce storage cost and improve IO performance” and Andy notes that “SSD is getting more and more common for servers. They are very fast. The problems are high costs and low capacity.” There are many more in the ticket.

In prior blogs we’ve written about significant performance advantages when using Fractal Tree Indexes with MongoDB. Compression has always been a key feature of Fractal Tree Indexes. We currently support the LZMA, quicklz, and zlib compression algorithms, and our architecture allows us to easily add more. Our large block size creates another advantage as these algorithms tend to compress large blocks better than small ones.

Given the interest in compression for MongoDB and our capabilities to address this functionality, we decided to do a benchmark to measure the compression achieved by MongoDB + Fractal Tree Indexes using each available compression type. The benchmark loads 51 million documents into a collection and measures the size of all files in the file system (–dbpath).

More benchmarks to follow and you should remember that all benchmarks are just that, benchmarks.

Benchmarks do not represent experience with your data, under your operating load and network conditions, etc.

Investigate software based on the first, purchase software based on the second.

Comments Off

February 25, 2013

Drill Sideways faceting with Lucene

Filed under: Facets,Indexing,Lucene,Searching — Patrick Durusau @ 5:13 am

Drill Sideways faceting with Lucene by Mike McCandless.

From the post:

Lucene’s facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There’s been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:

[3.8X speedup!]

….

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn’s faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or’d value to filter on (in the multi-select case). (images omitted)

More details: DrillSideways class being developed under LUCENE-4748.

Just following the progress on Lucene is enough to make you dizzy!

Comments Off

February 24, 2013

Text processing (part 2): Inverted Index

Filed under: Indexing,Lucene — Patrick Durusau @ 5:41 pm

Text processing (part 2): Inverted Index by Ricky Ho.

From the post:

This is the second part of my text processing series. In this blog, we’ll look into how text documents can be stored in a form that can be easily retrieved by a query. I’ll used the popular open source Apache Lucene index for illustration.

Not only do you get to learn about inverted indexes but some Lucene in the bargain.

That’s not a bad deal!

Comments Off

February 23, 2013

Indexing StackOverflow In Solr

Filed under: Indexing,Solr — Patrick Durusau @ 7:12 pm

Indexing StackOverflow In Solr by John Berryman.

From the post:

One thing I really like about Solr is that its super easy to get started. You just download solr, fire it up, and then after following the 10 minute tutorial you’ll have a basic understand of indexing, updating, searching, faceting, filtering, and generally using Solr. But, you’ll soon get bored of playing with the 50 or so demo documents. So, quit insulting Solr with this puny, measly, wimpy dataset; Index something of significance and watch what Solr can do.

One of the most approachable large datasets is the StackExchange data set which most notably includes all of StackOverflow, but also contains many of the other StackExchange sites (Cooking, English Grammar, Bicycles, Games, etc.) So if StackOverflow is not your cup of tea, there’s bound to be a data set in there that jives more with your interests.

Once you’ve pulled down the data set, then you’re just moments away from having your own SolrExchange index. Simply unzip the dataset that you’re interested in (7-zip format zip files), pull down this git repo which walks you through indexing the data, and finally, just follow the instructions in the README.md.

Interesting data set for Solr.

More importantly, a measure of how easy it needs to be to get started with software.

Software like topic maps.

Suggestions?

Comments Off

February 22, 2013

Machine Biases: Stop Word Lists

Filed under: Indexing,Searching — Patrick Durusau @ 7:09 am

Oracle Text Search doesn’t work on some words (Stackoverflow)

From the post:

I am using Oracle’ Text Search for my project. I created a ctxsys.context index on my column and inserted one entry “Would you like some wine???”. I executed the query

select guid, text, score(10) from triplet where contains (text, ‘Would’, 10) > 0

it gave me no results. Querying ‘you’ and ‘some’ also return zero results. Only ‘like’ and ‘wine’ matches the record. Does Oracle consider you, would, some as stop words?? How can I let Oracle match these words? Thank you.

I mention this because Oracle Text Workaround for Stop Words List (Beyond Search) reports this as “interesting.”

Hardly. Stop lists are a common feature of text searching.

Moreover, the stop list in question, dates back in MySQL to 3.23.

11.9.4. Full-Text Stopwords excludes “you,” and “would” from indexing.

Suggestions of other stop word lists that exclude “you,” and “would?”

Comments Off

February 20, 2013

NoSQL is Great, But You Still Need Indexes [MongoDB for example]

Filed under: Fractal Trees,Indexing,MongoDB,NoSQL,TokuDB,Tokutek — Patrick Durusau @ 9:23 pm

NoSQL is Great, But You Still Need Indexes by Martin Farach-Colton.

From the post:

I’ve said it before, and, as is the nature of these things, I’ll almost certainly say it again: your database performance is only as good as your indexes.

That’s the grand thesis, so what does that mean? In any DB system — SQL, NoSQL, NewSQL, PostSQL, … — data gets ingested and organized. And the system answers queries. The pain point for most users is around the speed to answer queries. And the query speed (both latency and throughput, to be exact) depend on how the data is organized. In short: Good Indexes, Fast Queries; Poor Indexes, Slow Queries.

But building indexes is hard work, or at least it has been for the last several decades, because almost all indexing is done with B-trees. That’s true of commercial databases, of MySQL, and of most NoSQL solutions that do indexing. (The ones that don’t do indexing solve a very different problem and probably shouldn’t be confused with databases.)

It’s not true of TokuDB. We build Fractal Tree Indexes, which are much easier to maintain but can still answer queries quickly. So with TokuDB, it’s Fast Indexes, More Indexes, Fast Queries. TokuDB is usually thought of as a storage engine for MySQL and MariaDB. But it’s really a B-tree substitute, so we’re always on the lookout for systems where we can improving the indexing.

Enter MongoDB. MongoDB is beloved because it makes deployment fast. But when you peel away the layers, you get down to a B-tree, with all the performance headaches and workarounds that they necessitate.

That’s the theory, anyway. So we did some testing. We ripped out the part of MongoDB that takes care of secondary indices and plugged in TokuDB. We’ve posted the blogs before, but here they are again, the greatest hits of TokuDB+MongoDB: we show a 10x insertion performance, a 268x query performance, and a 532x (or 53,200% if you prefer) multikey index insertion performance. We also discussed covered indexes vs. clustered Fractal Tree Indexes.

Did somebody declare February 20th to be performance release day?

Did I miss that memo? 😉

Like every geek, I like faster. But, here’s my question:

Have there been any studies on the impact of faster systems on searching and decision making by users?

My assumption is the faster I get a non-responsive result from a search, the sooner I can improve it.

But that’s an assumption on my part.

Is that really true?

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 21, 2013

June 18, 2013

June 15, 2013

June 11, 2013

June 10, 2013

June 9, 2013

May 24, 2013

May 22, 2013

May 20, 2013

May 14, 2013

May 8, 2013

May 6, 2013

April 29, 2013

April 19, 2013

April 6, 2013

March 30, 2013

March 28, 2013

March 25, 2013

March 22, 2013

March 15, 2013

March 11, 2013

March 4, 2013

March 1, 2013

February 25, 2013

February 24, 2013

February 23, 2013

February 22, 2013

February 20, 2013