Indexing « Another Word For It

February 1, 2013

Concurrency Improvements in TokuDB v6.6 (Part 1)

Filed under: Concurrent Programming,Indexing — Patrick Durusau @ 8:07 pm

Concurrency Improvements in TokuDB v6.6 (Part 1)

From the post:

With TokuDB v6.6 out now, I’m excited to present one of my favorite enhancements: concurrency within a single index. Previously, while there could be many SQL transactions in-flight at any given moment, operations inside a single index were fairly serialized. We’ve been working on concurrency for a few versions, and things have been getting a lot better over time. Today I’ll talk about what to expect from v6.6. Next time, we’ll see why.

Impressive numbers as always!

Should get you interested in learning how this was done as an engineering matter. (That’s in part 2.)

Comments Off

January 27, 2013

Getting real-time field values in Lucene

Filed under: Indexing,Lucene — Patrick Durusau @ 5:41 pm

Getting real-time field values in Lucene by Mike McCandless.

From the post:

We know Lucene’s near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It’s simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

I saw a webinar by Mike McCandless that is probably the only webinar I would ever repeat watching.

Organized, high quality technical content, etc.

Compare that to a recent webinar I watched that spent fify-five (55) minutes reviewing information know to anyone who could say the software’s name. The speaker then lamented the lack of time to get into substantive issues.

When you see a webinar like Mike’s, drop me a line. We need to promote that sort of presentation over the other.

Comments Off

January 24, 2013

Apache Lucene 4.1 and Apache SolrTM 4.1 available

Filed under: Indexing,Lucene,Searching,Solr — Patrick Durusau @ 8:10 pm

Lucene 4.1 can be downloaded from: http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene CHANGES.text

Solr 4.1 can be downloaded from: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr CHANGES.txt

That’s getting the new year off to a great start!

Comments Off

January 7, 2013

A new Lucene highlighter is born [The final inch problem]

Filed under: Indexing,Lucene,Searching,Synonymy — Patrick Durusau @ 10:27 am

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

Be sure to follow the link to Mike’s comments about the limitations on SynonymFilter and the difficulty of correction.

Comments Off

December 18, 2012

Superlinear Indexes

Filed under: Indexing — Patrick Durusau @ 4:33 pm

Superlinear Indexes

From the webpage:

Multidimensional and String Indexes for Streaming Data

The Superlinear Index project is investigating how to data structures and algorithms for maintaining superlinear indexes on out-of-core storage (such as disk drives), with high incoming data rates. To understand what a superlinear index is, consider a linear index, which provides a total order on keys. A superlinear index is more complex than a total order. Examples of superlinear indexes including multidimensional indexes and full-text indexes.

A number of publications but none in 2012.

I will be checking to see if the project is still alive.

Comments Off

December 16, 2012

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications

Filed under: Indexing,Lucene,Searching — Patrick Durusau @ 8:12 pm

Closing the Knowledge Gap: A Case Study – How Cisco Unlocks Communications by Tony Frazier, Director of Product Management, Cisco Systems and David Fishman, Marketing, Lucid Imagination.

A bit dated (2010) but I think you will find this interesting reading.

A couple of snippets to tempt you into reading the full post:

Consider this: you’re looking for information and immediately search the documents at your disposal to find the answer. Are you the first person who conducted this search? If you are in a reasonably large organization, given the scope and mix of electronic communications today, there could be more than 10 other employees looking for the same answer. Unearthing documents, one employee at a time, may not be the best way of tapping into that collective intellect and maximizing resources across an organization. Wouldn’t it make more sense to tap into existing discussions taking place across the network—over email, voice and increasingly video communications?

and,

Cisco’s approach to this project centered on vocabulary-based tagging and search. Every organization has the ability to define keywords for their personalized library. Cisco Pulse then tags a user’s activity, content and behavior in electronic communications to match the vocabulary, presenting valuable information that simplifies and accelerates knowledge sharing across an organization. Vocabulary-based tagging makes unlocking the relevant content of electronic communications safe and efficient.

Add a little vocabulary mapping with topic maps, toss and serve!

Comments Off

December 8, 2012

Looking at a Plaintext Lucene Index

Filed under: Indexing,Lucene — Patrick Durusau @ 5:24 pm

Looking at a Plaintext Lucene Index by Florian Hopf.

From the post:

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can’t really inspect if you don’t use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

Good starting point for learning more about Lucene indexes.

Comments Off

December 6, 2012

How We Read….[Does Your Topic Map Contribute to Information Overload?]

Filed under: Indexing,Information Overload,Interface Research/Design,Usability,Users — Patrick Durusau @ 11:43 am

How we read, not what we read, may be contributing to our information overload by Justin Ellis.

From the post:

Every day, a new app or service arrives with the promise of helping people cut down on the flood of information they receive. It’s the natural result of living in a time when an ever-increasing number of news providers push a constant stream of headlines at us every day.

But what if it’s the ways we choose to read the news — not the glut of news providers — that make us feel overwhelmed? An interesting new study out of the University of Texas looks at the factors that contribute to the concept of information overload, and found that, for some people, the platform on which news is being consumed can make all the difference between whether you feel overwhelmed.

The study, “News and the Overloaded Consumer: Factors Influencing Information Overload Among News Consumers” was conducted by Avery Holton and Iris Chyi. They surveyed more than 750 adults on their digital consumption habits and perceptions of information overload. On the central question of whether they feel overloaded with the amount of news available, 27 percent said “not at all”; everyone else reported some degree of overloaded.

The results imply that the more constrained the platform for delivery of content, the less overwhelmed users feel. Reading news on a cell phone for example. The links and videos on Facebook being at the other extreme.

Which makes me curious about information interfaces in general and topic map interfaces in particular.

Does the traditional topic map interface (think Omnigator) contribute to a feeling of information overload?

If so, how would you alter that display to offer the user less information by default but allow its expansion upon request?

Compare to a book index, which offers sparse information on a subject, that can be expanded by following a pointer to fuller treatment of a subject.

I don’t think replicating a print index with hyperlinks in place of traditional references is the best solution but it might be a starting place for consideration.

Comments Off

November 21, 2012

Lucene with Zing, Part 2

Filed under: Indexing,Java,Lucene,Zing JVM — Patrick Durusau @ 9:22 am

Lucene with Zing, Part 2 by Mike McCandless.

From the post:

When I last tested Lucene with the Zing JVM the results were impressive: Zing’s fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene’s 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory’s max throughput was now better than RAMDirectory’s (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory’s performance, and lower the pressure on concurrent mark/sweep’s GC to the point where pause times were so much lower in GA compared to alpha?

Mike updates his prior experience with Lucene and Zing.

Covers the use gcLogAnalyser and Fragger to understand “why” his performance test results changed from the alpha to GA releases.

Insights into both Lucene and Zing.

Have you considered loading your topic map into RAM?

Comments Off

October 15, 2012

Information needs of public health practitioners: a review of the literature [Indexing Needs]

Filed under: Biomedical,Indexing,Medical Informatics,Searching — Patrick Durusau @ 4:37 am

Information needs of public health practitioners: a review of the literature by Jennifer Ford and Helena Korjonen.

Abstract:

Objective

To review published literature covering the information needs of public health practitioners and papers highlighting gaps and potential solutions in order to summarise what is already known about this population and models tested to support them.

Methods

The search strategy included bibliographic databases LISTA, LISA, PubMed and Web of Knowledge. The results of this literature review were used to create two tables displaying published literature.

Findings

The literature highlighted that some research has taken place into different public health subgroups with consistent findings. Gaps in information provision have also been identified by looking at the information services provided.

Conclusion

There is a need for further research into information needs in subgroups of public health practitioners as this group is diverse, has different needs and needs varying information. Models of informatics that can support public health must be developed and published so that the public health information community can share experiences and solutions and begin to build an evidence-base to produce superior information systems for the goal of a healthier society.

One of the key points for topic map advocates:

The need for improved indexing of public health information was highlighted by Alpi, discussing the role of expert searching in public health information retrieval.2 Existing taxonomies such as the MeSH system used by PubMed/Medline are perceived as inadequate for indexing the breadth of public health literature and are seen to be too clinically focussed.2 There is also concern at the lack of systematic indexing of grey literature.2 Given that more than one study has highlighted the high level of use of grey literature by public health practitioners, this lack of indexing should be of real concern to public health information specialists and practitioners. LaPelle also found that participants in her research had experienced difficulties with search terms for public health which is indicative of the current inadequacy of public health indexing.1

Other opportunities for topic maps are apparent in the literature review but inadequate indexing should be topic maps bread and butter.

Comments Off

October 6, 2012

Webinar: Introduction to TokuDB v6.5 (Oct. 10, 2012)

Filed under: Fractal Trees,Indexing,MariaDB,MySQL,TokuDB — Patrick Durusau @ 3:37 pm

Webinar: Introduction to TokuDB v6.5

From the post:

TokuDB^® is a proven solution that scales MySQL^® and MariaDB^® from GBs to TBs with unmatched insert and query speed, compression, replication performance and online schema flexibility. Tokutek’s recently launched TokuDB v6.5 delivers all of these features and more, not just for HDDs, but also for flash memory.

Date: October 10th
Time: 2 PM EST / 11 AM PST
REGISTER TODAY

TokuDB v6.5:

Stores 10x More Data – TokuDB delivers 10x compression without any performance degradation. Users can therefore take advantage of much greater amounts of available space without paying more for additional storage.

Delivers High Insertion Speed – TokuDB Fractal Tree^® indexes continue to change the game with huge insertion rates and greater scalability. Our latest release delivers an order of magnitude faster insertion performance than the competition, ideal for applications that must simultaneously query and update large volumes of rapidly arriving data (e.g., clickstream analytics).

Allows Hot Schema Changes — Hot column addition/deletion/rename/resize provides the ability to add/drop/change a column to a database without taking the database offline, enabling database administrators to redefine or add new fields with no downtime.

Extends Wear Life for Flash– TokuDB’s proprietary Fractal Tree indexing writes fewer, larger blocks which reduces overall wear, and more efficiently utilizes the FTL (Flash Translation Layer). This extends the life of flash memory by an order of magnitude for many applications.

This webinar covers TokuDB features, latest performance results, and typical use cases.

You have seen the posts about fractal indexing! Now see the demos!

Comments Off

September 28, 2012

Three Ways that Fractal Tree Indexes Improve SSD for MySQL

Filed under: Fractal Trees,Indexing,MySQL — Patrick Durusau @ 8:34 am

Three Ways that Fractal Tree Indexes Improve SSD for MySQL

The three advantages:

Advantage 1: Index maintenence performance.
Advantage 2: Compression.
Advantage 3: Reduced wear.

See the post for details and the impressive numbers one expects from Fractal tree indexes.

Comments Off

September 22, 2012

Damn Cool Algorithms: Levenshtein Automata

Filed under: Indexing,Levenshtein Distance — Patrick Durusau @ 3:06 pm

Damn Cool Algorithms: Levenshtein Automata by Nick Johnson.

From the post:

In a previous Damn Cool Algorithms post, I talked about BK-trees, a clever indexing structure that makes it possible to search for fuzzy matches on a text string based on Levenshtein distance – or any other metric that obeys the triangle inequality. Today, I’m going to describe an alternative approach, which makes it possible to do fuzzy text search in a regular index: Levenshtein automata.

Introduction

The basic insight behind Levenshtein automata is that it’s possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It’s thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance – not a bad improvement to start with!

Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There’s much more to come, but first let’s see what a Levenshtein automaton looks like, and how we can build one.

Not recent but I think you will enjoy the post anyway.

I first saw this at DZone.

Comments (1)

Real-Time Twitter Search by @larsonite

Filed under: Indexing,Java,Relevance,Searching,Tweets — Patrick Durusau @ 1:18 pm

Real-Time Twitter Search by @larsonite by Marti Hearst.

From the post:

Brian Larson gives a brilliant technical talk about how real-time search Real-Time Twitter Search by @larsoniteworks at Twitter; He really knows what he’s talking about given that he’s the tech lead for search and relevance at Twitter!

The coverage of the real-time indexing, Java memory model, safe publication were particularly good.

As a bonus, also discusses relevance near the end of the presentation.

You may want to watch this more than once!

Brian recommends Java Concurrency in Practice by Brian Goetz as having good coverage of the Java memory model.

Comments Off

September 14, 2012

Looking for MongoDB users to test Fractal Tree Indexing

Filed under: Fractal Trees,Indexing,MongoDB,Tokutek — Patrick Durusau @ 10:03 am

Looking for MongoDB users to test Fractal Tree Indexing by Tim Callaghan.

In my three previous blogs I wrote about our implementation of Fractal Tree Indexes on MongoDB, showing a 10x insertion performance increase, a 268x query performance increase, and a comparison of covered indexes and clustered indexes. The benchmarks show the difference that rich and efficient indexing can make to your MongoDB workload.

It’s one thing for us to benchmark MongoDB + TokuDB and another to measure real world performance. If you are looking for a way to improve the performance or scalability of your MongoDB deployment, we can help and we’d like to hear from you. We have a preview build available for MongoDB v2.2 that you can run with your existing data folder, drop/add Fractal Tree Indexes, and measure the performance differences. Please email me at tim@tokutek.com if interested.

Here is your chance to try these speed improvements out on your data!

Comments Off

September 2, 2012

Understanding Indexing: …[Tokutek]

Filed under: Indexing,Tokutek — Patrick Durusau @ 6:34 pm

Understanding Indexing: Three rules on making indexes around queries to provide good performance (video) – slides

Tim Callaghan mentions this as coming up but here is the description from the on-demand version of this webinar:

Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.

This is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.

Zardosht Kasheff presenting.

Comments Off

September 1, 2012

An Indexing Structure for Dynamic Multidimensional Data in Vector Space

Filed under: Indexing,Multidimensional,Vector Space Model (VSM) — Patrick Durusau @ 3:48 pm

An Indexing Structure for Dynamic Multidimensional Data in Vector Space by Elena Mikhaylova, Boris Novikov and Anton Volokhov. (Advances in Databases and Information Systems, Advances in Intelligent Systems and Computing, 2013, Volume 186, 185-193, DOI: 10.1007/978-3-642-32741-4_17)

Abstract:

The multidimensional k – NN (k nearest neighbors) query problem is relevant to a large variety of database applications, including information retrieval, natural language processing, and data mining. To solve it efficiently, the database needs an indexing structure that provides this kind of search. However, attempts to find an exact solution are hardly feasible in multidimensional space. In this paper, a novel indexing technique for the approximate solution of k – NN problem is described and analyzed. The construction of the indexing tree is based on clustering. Indexing structure is implemented on top of high-performance industrial DBMS.

The review of recent work is helpful but when the paper reaches the algorithm for indexing “…dynamic multidimensional data…,” it slips away from me.

Where is the dynamic nature of the data that is being overcome by the indexing?

I ask because we are human observers are untroubled by the curse of dimensionality, even when data is dynamically changing.

Although those are two important aspects when we process it by machine:

The number of dimensions of data, and
The rate at which the data is changing.

Comments Off

August 27, 2012

HAIL – Only Aggressive Elephants are Fast Elephants

Filed under: Hadoop,Indexing — Patrick Durusau @ 6:26 pm

HAIL – Only Aggressive Elephants are Fast Elephants

From the post:

Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at the same time? Especially within a HDFS environment where each block is replicated a few times. This is the big idea that HAIL (Hadoop Aggressive Indexing Library) pursues.

At a very high level it looks like to understand the working of HAIL we will have to look at the three distinct workflows the system is organized around namely –

The data/file upload pipeline

The indexing pipeline

The query pipeline

Every unit of information makes its journey through these three pipelines.

Be sure to see the original paper.

How much of what we “know” about modeling is driven by the needs of ancestral storage layouts?

Given the performance of modern chips, are those “needs” still valid considerations?

Or perhaps better, at what size data store or processing requirement do the physical storage model needs re-assert themselves?

Not just a performance question but also one of uniformity of identification.

What was once a “performance” requirement, that data have some common system of identification, may not longer be the case.

Comments Off

August 1, 2012

Indexes in RAM?

Filed under: Indexing,Lucene,Zing JVM — Patrick Durusau @ 6:35 am

The Mike McCandless post: Lucene index in RAM with Azul’s Zing JVM will help make your case for putting your index in RAM!

From the post:

Google’s entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene’s RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven’t been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe’s excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven’t been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let’s test it! But first, a quick digression on the importance of measuring search response time of all requests.

There are obvious speed advantages to holding indexes in RAM.

Curious, is RAM just a quick disk? Or do we need to think about data structures/access differently with RAM? Pointers?

Comments Off

July 31, 2012

Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Filed under: Indexing,Lucene,UIMA — Patrick Durusau @ 4:41 pm

Running a UIMA Analysis Engine in a Lucene Analyzer Chain by Sujit Pal.

From the post:

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.

[Graphic omitted]

As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer’s state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

The second of two posts from Jack Park.

Part of my continuing interest in indexing. In part because we know that indexing scales. Seriously scales.

Comments Off

UIMA Analysis Engine for Keyword Recognition and Transformation

Filed under: Indexing,Lucene,UIMA — Patrick Durusau @ 4:39 pm

UIMA Analysis Engine for Keyword Recognition and Transformation by Sujit Pal.

From the post:

You have probably noticed that I’ve been playing with UIMA lately, perhaps a bit aimlessly. One of my goals with UIMA is to create an Analysis Engine (AE) that I can plug into the front of the Lucene analyzer chain for one of my applications. The AE would detect and mark keywords in the input stream so they would be exempt from stemming by downstream Lucene analyzers.

So couple of weeks ago, I picked up the bits and pieces of UIMA code that I had written and started to refactor them to form a sequence of primitive AEs that detected keywords in text using pattern and dictionary recognition. Each primitive AE places new KeywordAnnotation objects into an annotation index.

The primitive AEs I came up with are pretty basic, but offers a surprising amount of bang for the buck. There are just two annotators – the PatternAnnotator and DictionaryAnnotator – that do the processing for my primitive AEs listed below. Obviously, more can be added (and will, eventually) as required.

Pattern based keyword recognition

Pattern based keyword recognition and transformation

Dictionary based keyword recognition, case sensitive

Dictionary based keyword recognition and transformation, case sensitive

Dictionary based keyword recognition, case insensitive

Dictionary based keyword recognition and transformation, case insensitive

The first of two posts that I missed from last year, recently brought to my attention by Jack Park.

The ability to annotate, implying, among other things, the ability to create synonym annotations for keywords.

Comments Off

July 29, 2012

Building a new Lucene postings format

Filed under: Indexing,Lucene — Patrick Durusau @ 10:08 am

Building a new Lucene postings format by Mike McCandless.

From the post:

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene’s index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

Current testing of formats requires the entire format be specified, which means errors are hard to diagnose.

Mike addresses that problem by creating a layered testing mechanism.

Great stuff!

PS: I think it will also be useful as an educational tool. Changing defined formats and testing as changes are made.

Comments Off

July 26, 2012

Schema.org and One Hundred Years of Search

Filed under: Indexing,Searching,Text Mining,Web History — Patrick Durusau @ 2:13 pm

Schema.org and One Hundred Years of Search by Dan Brickley.

From the post:

Slides and video are already in the Web, but I wanted to post this as an excuse to plug the new Web History Community Group that Max and I have just started at W3C. The talk was part of the Libraries, Media and the Semantic Web meetup hosted by the BBC in March. It gave an opportunity to run through some forgotten history, linking Paul Otlet, the Universal Decimal Classification, schema.org and some 100 year old search logs from Otlet’s Mundaneum. Having worked with the BBC Lonclass system (a descendant of Otlet’s UDC), and collaborated with the Aida Slavic of the UDC on their publication of Linked Data, I was happy to be given the chance to try to spell out these hidden connections. It also turned out that Google colleagues have been working to support the Mundaneum and the memory of this early work, and I’m happy that the talk led to discussions with both the Mundaneum and Computer History Museum about the new Web History group at W3C.

Sounds like a great starting point!

But the intellectual history of indexing and search runs far deeper than one hundred years. Our current efforts are likely to profit from a deeper knowledge of our roots.

Comments Off

Understanding Indexing [Webinar]

Filed under: Database,Indexing — Patrick Durusau @ 8:12 am

Understanding Indexing [Webinar]

July 31st 2012 Time: 2PM EDT / 11AM PDT

From the post:

Three rules on making indexes around queries to provide good performance

Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.

[graphic button omitted]

This webinar is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.

Indexing is one of those “overloaded” terms in information technologies.

Indexing can refer to:

Database indexing
Search engine indexing
Human indexing

just to name a few of the more obvious uses.

To be sure, you need to be aware of, if not proficient at, all three and this webinar should be a start on #1.

PS: If you know of a more complete typology of indexing, perhaps with pointers into the literature, please give a shout!

Comments Off

July 22, 2012

Apache Lucene 3.6.1 and Apache Solr 3.6.1 available

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 6:21 pm

Lucene/Solr news on 22 July 2012:

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.6.1 and Apache Solr 3.6.1.

This release is a bug fix release for version 3.6.0. It contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-3x-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-3x-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Lucene 3.6.1 Release Highlights:

The concurrency of MMapIndexInput.clone() was improved, which caused a performance regression in comparison to Lucene 3.5.0.

MappingCharFilter was fixed to return correct final token positions.

QueryParser now supports +/- operators with any amount of whitespace.

DisjunctionMaxScorer now implements visitSubScorers().

Changed the visibility of Scorer#visitSubScorers() to public, otherwise it’s impossible to implement Scorers outside the Lucene package. This is a small backwards break, affecting a few users who implemented custom Scorers.

Various analyzer bugs where fixed: Kuromoji to not produce invalid token graph due to UNK with punctuation being decompounded, invalid position length in SynonymFilter, loading of Hunspell dictionaries that use aliasing, be consistent with closing streams when loading Hunspell affix files.

Various bugs in FST components were fixed: Offline sorter minimum buffer size, integer overflow in sorter, FSTCompletionLookup missed to close its sorter.

Fixed a synchronization bug in handling taxonomies in facet module.

Various minor bugs were fixed: BytesRef/CharsRef copy methods with nonzero offsets and subSequence off-by-one, TieredMergePolicy returned wrong-scaled floor segment setting.

Solr 3.6.1 Release Highlights:

The concurrency of MMapDirectory was improved, which caused a performance regression in comparison to Solr 3.5.0. This affected users with 64bit platforms (Linux, Solaris, Windows) or those explicitely using MMapDirectoryFactory.

ReplicationHandler “maxNumberOfBackups” was fixed to work if backups are triggered on commit.

Charset problems were fixed with HttpSolrServer, caused by an upgrade to a new Commons HttpClient version in 3.6.0.

Grouping was fixed to return correct count when not all shards are queried in the second pass. Solr no longer throws Exception when using result grouping with main=true and using wt=javabin.

Config file replication was made less error prone.

Data Import Handler threading fixes.

Various minor bugs were fixed.

What a nice way to start the week!

Thanks to the Lucene PMC!

Comments Off

July 14, 2012

Text Mining Methods Applied to Mathematical Texts

Filed under: Indexing,Mathematics,Mathematics Indexing,Search Algorithms,Searching — Patrick Durusau @ 10:49 am

Text Mining Methods Applied to Mathematical Texts (slides) by Yannis Haralambous, Département Informatique, Télécom Bretagne.

Abstract:

Up to now, flexiform mathematical text has mainly been processed with the intention of formalizing mathematical knowledge so that proof engines can be applied to it. This approach can be compared with the symbolic approach to natural language processing, where methods of logic and knowledge representation are used to analyze linguistic phenomena. In the last two decades, a new approach to natural language processing has emerged, based on statistical methods and, in particular, data mining. This method, called text mining, aims to process large text corpora, in order to detect tendencies, to extract information, to classify documents, etc. In this talk I will present math mining, namely the potential applications of text mining to mathematical texts. After reviewing some existing works heading in that direction, I will formulate and describe several roadmap suggestions for the use and applications of statistical methods to mathematical text processing: (1) using terms instead of words as the basic unit of text processing, (2) using topics instead of subjects (“topics” in the sense of “topic models” in natural language processing, and “subjects” in the sense of various mathematical subject classifications), (3) using and correlating various graphs extracted from mathematical corpora, (4) use paraphrastic redundancy, etc. The purpose of this talk is to give a glimpse on potential applications of the math mining approach on large mathematical corpora, such as arXiv.org.

An invited presentation at CICM 2012.

I know Yannis from a completely different context and may comment on that in another post.

No paper but 50+ slides showing existing text mining tools can deliver useful search results, while waiting for a unified and correct index to all of mathematics. 😉

Varying semantics, as in all human enterprises, is an opportunity for topic map based assistance.

Comments Off

An XML-Format for Conjectures in Geometry (Work-in-Progress)

Filed under: Geometry,Indexing,Keywords,Mathematical Reasoning,Mathematics,Ontology — Patrick Durusau @ 10:33 am

An XML-Format for Conjectures in Geometry (Work-in-Progress) by Pedro Quaresma.

Abstract:

With a large number of software tools dedicated to the visualisation and/or demonstration of properties of geometric constructions and also with the emerging of repositories of geometric constructions, there is a strong need of linking them, and making them and their corpora, widely usable. A common setting for interoperable interactive geometry was already proposed, the i2g format, but, in this format, the conjectures and proofs counterparts are missing. A common format capable of linking all the tools in the field of geometry is missing. In this paper an extension of the i2g format is proposed, this extension is capable of describing not only the geometric constructions but also the geometric conjectures. The integration of this format into the Web-based GeoThms, TGTP and Web Geometry Laboratory systems is also discussed.

The author notes open questions as:

The xml format must be complemented with an extensive set of converters allowing the exchange of information between as many geometric tools as possible.

The databases queries, as in TGTP, raise the question of selecting appropriate keywords. A fine grain index and/or an appropriate geometry ontology should be addressed.

The i2gatp format does not address proofs. Should we try to create such a format? The GATPs produce proofs in quite different formats, maybe the construction of such unifying format it is not possible and/or desirable in this area.

The “keywords,” “fine grained index,” “geometry ontology,” question yells “topic map” to me.

You?

PS: Converters and different formats also say “topic map,” just not as loudly to me. Your volume may vary. (YVMV)

Comments (1)

July 5, 2012

Search and Counterfactuals

Filed under: Counterfactual,Indexing,Search Algorithms,Searching — Patrick Durusau @ 6:53 pm

In Let the Children Play, It’s Good for Them! (Smithsonian, July/August 2012) Alison Gopnik writes:

Walk into any preschool and you’ll find toddling superheroes battling imaginary monsters. We take it for granted that young children play and, especially, pretend. Why do they spend so much time in fantasy worlds?

People have suspected that play helps children learn, but until recently there was little research that showed this or explained why it might be true. In my lab at the University of California at Berkeley, we’ve been trying to explain how very young children can learn so much so quickly, and we’ve developed a new scientific approach to children’s learning.

Where does pretending come in? It relates to what philosophers call “counterfactual” thinking, like Einstein wondering what would happen if a train went at the speed of light.

Do our current models for search encourage or discourage counterfactual thinking? Neutral?

There is place for “factual” queries: Has “Chipper” Jones, who plays for the Atlanta Braves, ever hit safely 5 out of 5 times in a game? 😉

But what of counterfactuals?

Do they lead us to new forms of indexing? By re-imagining how searching could be done, if and only if there were a new indexing structure?

Are advances in algorithms largely due to counterfactuals? Where the “factuals” are the world of processing as previously imagined?

We can search for the “factuals,” prior answers approved by authorities, but how does one search for a counterfactual?

Or search for what triggers a counterfactual?

I don’t have even an inkling at an answer or what an answer might look like, but thought it would be worth asking the question.

Comments Off

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases

Filed under: Indexing,Searching,String Matching — Patrick Durusau @ 3:21 pm

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases by Sourav Dutta and Arnab Bhattacharya.

Abstract:

The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. We also propose an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions which are unable to handle such queries efficiently. We perform several experiments portraying that INSTRUCT outperforms the existing structures by nearly a factor of two in terms of space requirements, while the query times are better. The ability to handle insertion and deletion of strings in addition to supporting all kinds of queries including exact search, prefix/suffix search and substring search makes INSTRUCT a complete data structure.

From the introduction:

As all strings are composed of a defined set of characters, reusing the storage space for common characters promises to provide the most compressed form of representation. This redundancy linked with the need for extreme space-efficient index structures motivated us to develop INSTRUCT (INdexing STrings by Re-Using Common Triplets).

By the time of a presentation (below) on the technique, the authors apparently rethought the name, settling on:

“SPACE-EFFICIENT MANAGEMENT OF TEXT USING INDEXED KEYS” (SEManTIKs)“

Neither one really “rolls off the tongue,” but I suspect searching for the second may be somewhat easier.

I say that, but “semantiks” turns out to be a women’s clothing line and at least one popular search engine offers to correct to “semantics.” I am sure a “gathered scoop neck” and a “flattering boot-cut silhouette” are all quite interesting but not really on point.

The slide presentation lists some fourteen (14) other approaches that can be compared to the one developed by the authors. (I am assuming the master’s thesis by Dutta has the details on the comparisons with the other techniques. I haven’t found it online but have written to request a copy.)

This work demonstrates that we are no where near the end of improvements for indexing and search.

See also the presentation: SPACE-EFFICIENT MANAGEMENT OF STRING DATABASES BY REUSING COMMON CHARACTERS by Sourav Dutta.

Comments Off

July 4, 2012

Batch Importer – Part 3 [Neo4j]

Filed under: Indexing,Neo4j — Patrick Durusau @ 3:29 pm

Batch Importer – Part 3 [Neo4j] by Max De Marzi.

From the post:

At the end of February, we took a look at Michael Hunger’s Batch Importer. It is a great tool to load millions of nodes and relationships into Neo4j quickly. The only thing it was missing was Indexing… I say was, because I just submitted a pull request to add this feature. Let’s go through how it was done so you get an idea of what the Neo4j Batch Import API looks like, and in the next blog post I’ll show you how to generate data to take advantage of it.

Another awesome post on Neo4j from Max De Marzi.

Definitely a series to follow.

In case you don’t have the links handy:

Batch Importer – Part 2

Batch Importer – Part 1

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 1, 2013

January 27, 2013

January 24, 2013

January 7, 2013

December 18, 2012

December 16, 2012

December 8, 2012

December 6, 2012

November 21, 2012

October 15, 2012

October 6, 2012

September 28, 2012

September 22, 2012

September 14, 2012

September 2, 2012

September 1, 2012

August 27, 2012

August 1, 2012

July 31, 2012

July 29, 2012

July 26, 2012

July 22, 2012

July 14, 2012

July 5, 2012

July 4, 2012