Archive for the ‘Indexing’ Category

The Cartoon Bank

Friday, May 12th, 2017

The Cartoon Bank by the Condé Nast Collection.

While searching for a cartoon depicting Sean Spicer at a White House news briefing, I encountered The Cartoon Bank.

A great source of instantly recognized cartoons but I’m still searching for one I remember from decades ago. 😉

Lucene/Solr 6.0 Hits The Streets! (There goes the weekend!)

Friday, April 8th, 2016

From the Lucene PMC:

The Lucene PMC is pleased to announce the release of Apache Lucene 6.0.0 and Apache Solr 6.0.0

Lucene can be downloaded from
and Solr can be downloaded from

Highlights of this Lucene release include:

  • Java 8 is the minimum Java version required.
  • Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
  • Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
  • The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
  • Spatial4j has been updated to a new 0.6 version hosted by locationtech.
  • TermsQuery performance boost by a more aggressive default query caching policy.
  • IndexSearcher’s default Similarity is now changed to BM25Similarity.
  • Easier method of defining custom CharTokenizer instances.

Highlights of this Solr release include:

  • Improved defaults for “Similarity” used in Solr, in order to provide better default experience for new users.
  • Improved “Similarity” defaults for users upgrading: DefaultSimilarityFactory has been removed, implicit default Similarity has been changed to SchemaSimilarityFactory, and SchemaSimilarityFactory has been modified to use BM25Similarity as the default for field types that do not explicitly declare a Similarity.
  • Deprecated GET methods for schema are now accessible through the bulk API. The output has less details and is not backward compatible.
  • Users should set useDocValuesAsStored=”false” to preserve sort order on multi-valued fields that have both stored=”true” and docValues=”true”.
  • Formatted date-times are more consistent with ISO-8601. BC dates are now better supported since they are now formatted with a leading ‘-‘. AD years after 9999 have a leading ‘+’. Parse exceptions have been improved.
  • Deprecated SolrServer and subclasses have been removed, use SolrClient instead.
  • The deprecated configuration in solrconfig.xml has been removed. Users must remove it from solrconfig.xml.
  • SolrClient.shutdown() has been removed, use SolrClient.close() instead.
  • The deprecated zkCredientialsProvider element in solrcloud section of solr.xml is now removed. Use the correct spelling (zkCredentialsProvider) instead.
  • Added support for executing Parallel SQL queries across SolrCloud collections. Includes StreamExpression support and a new JDBC Driver for the SQL Interface.
  • New features and capabilities added to the streaming API.
  • Added support for SELECT DISTINCT queries to the SQL interface.
  • New GraphQuery to enable graph traversal as a query operator.
  • New support for Cross Data Center Replication consisting of active/passive replication for separate SolrClouds hosted in separate data centers.
  • Filter support added to Real-time get.
  • Column alias support added to the Parallel SQL Interface.
  • New command added to switch between non/secure mode in zookeeper.
  • Now possible to use IP fragments in replica placement rules.

For features new to Solr 6.0, be sure to consult the unreleased Solr reference manual. (unreleased as of 8 April 2016)

Happy searching!

600 websites about R [How to Avoid Duplicate Content?]

Sunday, November 8th, 2015

600 websites about R by Laetitia Van Cauwenberge.

From the post:

Anyone interested in categorizing them? It could be an interesting data science project, scraping these websites, extracting keywords, and categorizing them with a simple indexation or tagging algorithm. For instance, some of these blogs cater about stats, or Bayesian stats, or R libraries, or R training, or visualization, or anything else. This indexation technique was used here to classify 2,500 data science websites. For web crawling tutorials, click here or here.

BTW, Laetitia lists, with links, all 600 R sites.

How many of those R sites will you visit?

Or will you scan the list for your site or your favorite R site?

For that matter, how duplicated content are you going to find at those R sites?

All have some unique content, but neither an index nor classification will help you find unique content.

Thinking of this as a potential data science experiment, we have a list of 600 sites with content related to R.

What would be your next step towards avoiding duplicated content?

By what criteria would you judge “success” in avoiding duplicate content?

One Million Contributors to the Huffington Post

Wednesday, July 1st, 2015

Arianna Huffington’s next million mark by Ken Doctor.

From the post:

Before the end of this year, HuffPost will release new tech and a new app, opening the floodgates for contributors. The goal: Add 900,000 contributors to Huffington Post’s 100,000 current ones. Yes, one million in total.

How fast would Arianna like that to get that number?

“One day,” she joked, as we discussed her latest project, code-named Donatello for the Renaissance sculptor. Lots of people got to be Huffington Post contributors through Arianna Huffington. They’d meet her at book signing, send an email and find themselves hooked up. “It’s one of my favorite things,” she told me Thursday. Now, though, that kind of retail recruitment may be a vestige.

“It’s been an essential part of our DNA,” she said, talking about the user contributions that once seemed to outnumber the A.P. stories and smaller original news staff’s work. “We’ve always been a hybrid platform,” a mix of pros and contributors.

So what enables the new strategy? Technology, naturally.

HuffPost’s new content management system is now being extended to work as a self-publishing platform as well. It will allow contributors to post directly from their smartphones, and add in video. Behind the scenes, a streamlined approval system is intended to reduce human (editor) intervention. Get approved once, then publish away, “while preserving the quality,” Huffington added.

Adding another 900,000 contributors to the Huffington Post is going to bump their content production substantially.

So, here’s the question: Searching the Huffington Post site is as bad as most other media sites. What is adding content from another 900,000 contributors going to do for that experience? Get worse? That’s my first bet.

On the other hand, what if authors can unknowingly create topic maps? For example, auto-tagging offers Wikipedia links (one or more) for an entity in a story, for relationships, a drop down menu with roles for the major relationship types (slept-with being available for inside the Beltway), with auto-generated relationships to the author, events mentioned, other content at the Huffington Post.

Don’t solve the indexing/search problem after the fact, create smarter data up front. Promote the content with better tagging and relationships. With 1 million unpaid contributors trying to get their contributions noticed, a win-win situation.

Civil War Navies Bookworm

Tuesday, May 19th, 2015

Civil War Navies Bookworm by Abby Mullen.

From the post:

If you read my last post, you know that this semester I engaged in building a Bookworm using a government document collection. My professor challenged me to try my system for parsing the documents on a different, larger collection of government documents. The collection I chose to work with is the Official Records of the Union and Confederate Navies. My Barbary Bookworm took me all semester to build; this Civil War navies Bookworm took me less than a day. I learned things from making the first one!

This collection is significantly larger than the Barbary Wars collection—26 volumes, as opposed to 6. It encompasses roughly the same time span, but 13 times as many words. Though it is still technically feasible to read through all 26 volumes, this collection is perhaps a better candidate for distant reading than my first corpus.

The document collection is broken into geographical sections, the Atlantic Squadron, the West Gulf Blockading Squadron, and so on. Using the Bookworm allows us to look at the words in these documents sequentially by date instead of having to go back and forth between different volumes to get a sense of what was going on in the whole navy at any given time.

Before you ask:

The earlier post: Text Analysis on the Documents of the Barbary Wars

More details on Bookworm.

As with all ngram viewers, exercise caution in assuming a text string has uniform semantics across historical, ethnic, or cultural fault lines.

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th

Monday, April 6th, 2015

Barkan, Bintliff, and Whisner’s Fundamentals of Legal Research, 10th by Steven M Barkan; Barbara Bintliff; Mary Whisner. (ISBN-13: 9781609300562)


This classic textbook has been updated to include the latest methods and resources. Fundamentals of Legal Research provides an authoritative introduction and guide to all aspects of legal research, integrating electronic and print sources. The Tenth Edition includes chapters on the true basics (case reporting, statutes, and so on) as well as more specialized chapters on legislative history, tax law, international law, and the law of the United Kingdom. A new chapter addresses Native American tribal law. Chapters on the research process, legal writing, and citation format help integrate legal research into the larger process of solving legal problems and communicating the solutions. This edition includes an updated glossary of research terms and revised tables and appendixes. Because of its depth and breadth, this text is well suited for advanced legal research classes; it is a book that students will want to retain for future use. Moreover, it has a place on librarians’ and attorneys’ ready reference shelves. Barkan, Bintliff and Whisner’s Assignments to Fundamentals of Legal Research complements the text.

I haven’t seen this volume in hard copy but if you are interested in learning what connections researchers are looking for with search tools, law is a great place to start.

The purpose of legal research, isn’t to find the most popular “fact” (Google), or to find every term for a “fact” ever tweeted (Twitter), but rather to find facts and their relationships to other facts, which flesh out to a legal view of a situation in context.

If you think about it, putting legislation, legislative history, court records and decisions, along with non-primary sources online, is barely a start towards making that information “accessible.” A necessary first step but not sufficient for meaningful access.

Building a complete Tweet index

Sunday, April 5th, 2015

Building a complete Tweet index by Yi Zhuang.

Since it is Easter Sunday in many religious traditions, what could be more inspirational than “…a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.“?

From the post:

Today [11/8/2014], we are pleased to announce that Twitter now indexes every public Tweet since 2006.

Since that first simple Tweet over eight years ago, hundreds of billions of Tweets have captured everyday human experiences and major historical events. Our search engine excelled at surfacing breaking news and events in real time, and our search index infrastructure reflected this strong emphasis on recency. But our long-standing goal has been to let people search through every Tweet ever published.

This new infrastructure enables many use cases, providing comprehensive results for entire TV and sports seasons, conferences (#TEDGlobal), industry discussions (#MobilePayments), places, businesses and long-lived hashtag conversations across topics, such as #JapanEarthquake, #Election2012, #ScotlandDecides, #HongKong. #Ferguson and many more. This change will be rolling out to users over the next few days.

In this post, we describe how we built a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.

The most important factors in our design were:

  • Modularity: Twitter already had a real-time index (an inverted index containing about a week’s worth of recent Tweets). We shared source code and tests between the two indices where possible, which created a cleaner system in less time.
  • Scalability: The full index is more than 100 times larger than our real-time index and grows by several billion Tweets a week. Our fixed-size real-time index clusters are non-trivial to expand; adding capacity requires re-partitioning and significant operational overhead. We needed a system that expands in place gracefully.
  • Cost effectiveness: Our real-time index is fully stored in RAM for low latency and fast updates. However, using the same RAM technology for the full index would have been prohibitively expensive.
  • Simple interface: Partitioning is unavoidable at this scale. But we wanted a simple interface that hides the underlying partitions so that internal clients can treat the cluster as a single endpoint.
  • Incremental development: The goal of “indexing every Tweet” was not achieved in one quarter. The full index builds on previous foundational projects. In 2012, we built a small historical index of approximately two billion top Tweets, developing an offline data aggregation and preprocessing pipeline. In 2013, we expanded that index by an order of magnitude, evaluating and tuning SSD performance. In 2014, we built the full index with a multi-tier architecture, focusing on scalability and operability.

If you are interested in scaling search issues, this is a must read post!

Kudos to Twitter Engineering!

PS: Of course all we need now is a complete index to Hilary Clinton’s emails. The NSA probably has a copy.

You know, the NSA could keep the same name, National Security Agency, and take over providing backups and verification for all email and web traffic, including the cloud. Would have to work on who could request copies but that would resolve the issue of backups of the Internet rather neatly. No more deleted emails, tweets, etc.

That would be a useful function, as opposed to harvesting phone data on the premise that at some point in the future it might prove to be useful, despite having not proved useful in the past.

Apache Lucene 5.0.0

Sunday, February 22nd, 2015

Apache Lucene 5.0.0

For the impatient:

Lucene CHANGES.txt

From the post:

Highlights of the Lucene release include:

Stronger index safety

  • All file access now uses Java’s NIO.2 APIs which give Lucene stronger index safety in terms of better error handling and safer commits.
  • Every Lucene segment now stores a unique id per-segment and per-commit to aid in accurate replication of index files.
  • During merging, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

  • Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index.
  • Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.
  • The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.
  • 5.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap.

Other features

  • FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.
  • Tokenizers and Analyzers no longer require Reader on init.
  • NormsFormat now gets its own dedicated NormsConsumer/Producer
  • SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene’s core.
  • PostingsFormat now uses a “pull” API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.
  • New DateRangeField type enables Indexing and searching of date ranges, particularly multi-valued ones.
  • A new ExitableDirectoryReader extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms.
  • Suggesters from multi-valued field can now be built as DocumentDictionary now enumerates each value separately in a multi-valued field.
  • ConcurrentMergeScheduler detects whether the index is on SSD or not and does a better job defaulting its settings. This only works on Linux for now; other OS’s will continue to use the previous defaults (tuned for spinning disks).
  • Auto-IO-throttling has been added to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate.
  • CustomAnalyzer has been added that allows to configure analyzers like you do in Solr’s index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
  • Memory index now supports payloads.
  • Added a filter cache with a usage tracking policy that caches filters based on frequency of use.
  • The default codec has an option to control BEST_SPEED or BEST_COMPRESSION for stored fields.
  • Stored fields are merged more efficiently, especially when upgrading from previous versions or using SortingMergePolicy

More goodness to start your week! (Update)

Wednesday, January 28th, 2015

I first wrote about in a post dated October 17, 2011.

A customer story from Microsoft: WorldWide Science Alliance and Deep Web Technologies made me revisit the site.

My original test query was “partially observable Markov processes” which resulted in 453 “hits” from at least 3266 found (2011 results). Today, running the same query resulted in “…1,342 top results from at least 25,710 found.” The top ninety-seven (97) were displayed.

A current description of the system from the customer story:

In June 2010, Deep Web Technologies and the Alliance launched multilingual search and translation capabilities with, which today searches across more than 100 databases in more than 70 countries. Users worldwide can search databases and translate results in 10 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, and Spanish. The solution also takes advantage of the Microsoft Audio Video Indexing Service (MAVIS). In 2011, multimedia search capabilities were added so that users could retrieve speech-indexed content as well as text.

The site handles approximately 70,000 queries and 1 million page views each month, and all traffic, including that from automated crawlers and search engines, amounts to approximately 70 million transactions per year. When a user enters a search term, instantly provides results clustered by topic, country, author, date, and more. Results are ranked by relevance, and users can choose to look at papers, multimedia, or research data. Divided into tabs for easy usability, the interface also provides details about each result, including a summary, date, author, location, and whether the full text is available. Users can print the search results or attach them to an email. They can also set up an alert that notifies them when new material is available.

Automated searching and translation can’t give you the semantic nuances possible by human authoring but it certainly can provide you with the source materials to build a specialized information resource with such semantics.

Very much a site to bookmark and use on a regular basis.

Links for subjects without them otherwise:

Deep Web Technologies

Microsoft Translator

Deep Learning: Methods and Applications

Tuesday, January 13th, 2015

Deep Learning: Methods and Applications by Li Deng and Dong Yu. (Li Deng and Dong Yu (2014), “Deep Learning: Methods and Applications”, Foundations and Trends® in Signal Processing: Vol. 7: No. 3–4, pp 197-387.


This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.


Deep learning, Machine learning, Artificial intelligence, Neural networks, Deep neural networks, Deep stacking networks, Autoencoders, Supervised learning, Unsupervised learning, Hybrid deep networks, Object recognition, Computer vision, Natural language processing, Language models, Multi-task learning, Multi-modal processing

If you are looking for another rich review of the area of deep learning, you have found the right place. Resources, conferences, primary materials, etc. abound.

Don’t be thrown off by the pagination. This is issues 3 and 4 of the periodical Foundations and Trends® in Signal Processing. You are looking at the complete text.

Be sure to read Selected Applications in Information Retrieval (Section 9, pages 308-319). Where 9.2 starts with:

Here we discuss the “semantic hashing” approach for the application of deep autoencoders to document indexing and retrieval as published in [159, 314]. It is shown that the hidden variables in the final layer of a DBN not only are easy to infer after using an approximation based on feed-forward propagation, but they also give a better representation of each document, based on the word-count features, than the widely used latent semantic analysis and the traditional TF-IDF approach for information retrieval. Using the compact code produced by deep autoencoders, documents are mapped to memory addresses in such a way that semantically similar text documents are located at nearby addresses to facilitate rapid document retrieval. The mapping from a word-count vector to its compact code is highly efficient, requiring only a matrix multiplication and a subsequent sigmoid function evaluation for each hidden layer in the encoder part of the network.

That is only one of the applications detailed in this work. I do wonder if this will be the approach that breaks the “document” (as in this work for example) model of information retrieval? If I am searching for “deep learning” and “information retrieval,” a search result that returns these pages would be a great improvement over the entire document. (At the user’s option.)

Before the literature on deep learning gets much more out of hand, now would be a good time to start building not only a corpus of the literature but a sub-document level topic map to ideas and motifs as they develop. That would be particularly useful as patents start to appear for applications of deep learning. (Not a volunteer or charitable venture.)

I first saw this in a tweet by StatFact.

Treasury Island: the film

Tuesday, November 25th, 2014

Treasury Island: the film by Lauren Willmott, Boyce Keay, and Beth Morrison.

From the post:

We are always looking to make the records we hold as accessible as possible, particularly those which you cannot search for by keyword in our catalogue, Discovery. And we are experimenting with new ways to do it.

The Treasury series, T1, is a great example of a series which holds a rich source of information but is complicated to search. T1 covers a wealth of subjects (from epidemics to horses) but people may overlook it as most of it is only described in Discovery as a range of numbers, meaning it can be difficult to search if you don’t know how to look. There are different processes for different periods dating back to 1557 so we chose to focus on records after 1852. Accessing these records requires various finding aids and multiple stages to access the papers. It’s a tricky process to explain in words so we thought we’d try demonstrating it.

We wanted to show people how to access these hidden treasures, by providing a visual aid that would work in conjunction with our written research guide. Armed with a tablet and a script, we got to work creating a video.

Our remit was:

  • to produce a video guide no more than four minutes long
  • to improve accessibility to these records through a simple, step-by–step process
  • to highlight what the finding aids and documents actually look like

These records can be useful to a whole range of researchers, from local historians to military historians to social historians, given that virtually every area of government action involved the Treasury at some stage. We hope this new video, which we intend to be watched in conjunction with the written research guide, will also be of use to any researchers who are new to the Treasury records.

Adding video guides to our written research guides are a new venture for us and so we are very keen to hear your feedback. Did you find it useful? Do you like the film format? Do you have any suggestions or improvements? Let us know by leaving a comment below!

This is a great illustration that data management isn’t something new. The Treasury Board has kept records since 1557 and has accumulated a rather extensive set of materials.

The written research guide looks interesting but since I am very unlikely to ever research Treasury Board records, I am unlikely to need it.

However, the authors have anticipated that someone might be interested in process of record keeping itself and so provided this additional reference:

Thomas L Heath, The Treasury (The Whitehall Series, 1927, GP Putnam’s Sons Ltd, London and New York)

That would be an interesting find!

I first saw this in a tweet by Andrew Janes.

Less Than Universal & Uniform Indexing

Wednesday, November 19th, 2014

In Suffix Trees and their Applications in String Algorithms, I pointed out that a subset of the terms for “suffix tree” resulted in About 1,830,000 results (0.22 seconds).

Not a very useful result, even for the most dedicated of graduate students. 😉

A better result would be an indexing entry for “suffix tree,” included results using its alternative names and enabled the user to quickly navigate to sub-entries under “suffix tree.”

To illustrate the benefit from actual indexing, consider that “Suffix Trees and their Applications in String Algorithms” lists only three keywords: “Pattern matching, String algorithms, Suffix tree.” Would you look at this paper for techniques on software maintenance?

Probably not, which would be a mistake. The section 4 covers the use of “parameterized pattern matching” for software maintenance of large programs in a fair amount of depth. Certainly more so than it covers “multidimensional pattern matching,” which is mentioned in the abstract and in the conclusion but not elsewhere in the paper. (“Higher dimensions” is mentioned on page 3 but only in two sentences with references.) Despite being mentioned in the abstract and conclusion as major theme of the paper.

A properly constructed index would break out both “parameterized pattern matching” and “software maintenance” as key subjects that occur in this paper. A bit easier to find than wading through 1,830,000 “results.”

Before anyone comments that such granular indexing would be too time consuming or expensive, recall the citation rates for computer science, 2000 – 2010:

Field 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 All years
Computer science 7.17 7.66 7.93 5.35 3.99 3.51 2.51 3.26 2.13 0.98 0.15 3.75

From: Citation averages, 2000-2010, by fields and years

The reason for the declining numbers is that citations to papers from the year 2000 decline over time.

But the highest percentage rate, 7.93 in 2002, is far less than the total number of papers published in 2000.

At one point in journal publication history, manual indexing was universal. But that was before full text searching became a reality and the scientific publication rate exploded.


The STM Report by Mark Ware and Michael Mabe.

Rather than an all human indexing model (not possible due to the rate of publication, costs) or an all computer-based searching model (leads to poor results as described above), why not consider a bifurcated indexing/search model?

The well over 90% of CS publications that aren’t cited should be subject to computer-based indexing and search models. On the other hand, the meager 8% that are cited, perhaps subject to some scale of citation, could be curated by human/machine assisted indexing.

Human/machine assisted indexing would increase access to material already selected by other readers. Perhaps even as a value-add product as opposed to take your chances with search access.

MeSH on Demand Update: How to Find Citations Related to Your Text

Wednesday, November 5th, 2014

MeSH on Demand Update: How to Find Citations Related to Your Text

From the post:

In May 2014, NLM introduced MeSH on Demand, a Web-based tool that suggests MeSH terms from your text such as an abstract or grant summary up to 10,000 characters using the MTI (Medical Text Indexer) software. For more background information, see the article, MeSH on Demand Tool: An Easy Way to Identify Relevant MeSH Terms.

New Feature

A new MeSH on Demand feature displays the PubMed ID (PMID) for the top ten related citations in PubMed that were also used in computing the MeSH term recommendations.

To access this new feature start from the MeSH on Demand homepage (see Figure 1), add your text, such as a project summary, into the box labeled “Text to be Processed.” Then, click the “Find MeSH Terms” button.

Results page:

mesh results

A clever way to deal with the problem of a searcher not knowing the specialized vocabulary of an indexing system.

Have you seen this method used outside of MeSH?

Google and Mission Statements

Wednesday, November 5th, 2014

Google has ‘outgrown’ its 14-year old mission statement, says Larry Page by Samuel Gibbs.

From the post:

Google’s chief executive Larry Page has admitted that the company has outgrown its mission statement to “organise the world’s information and make it universally accessible and useful” from the launch of the company in 1998, but has said he doesn’t yet know how to redefine it.

Page insists that the company is still focused on the altruistic principles that it was founded on in 1998 with the original mission statement, when he and co-founder Sergey Brin were aiming big with “societal goals” to “organise the world’s information and make it universally accessible and useful”.

Questioned as to whether Google needs to alter its mission statement, which was twinned with the company mantra “don’t be evil, for the next stage of company growth in an interview with the Financial Times, Page responded: “We’re in a bit of uncharted territory. We’re trying to figure it out. How do we use all these resources … and have a much more positive impact on the world?”

This post came as a surprise to me because I was unaware that Google had solved the problem of “organis[ing] the world’s information and mak[ing] it universally accessible and useful.”

Perhaps so but it hasn’t made it to the server farm that sends results to me.

A quick search using Google on “cia” today produces a front page with resources on the Central Intelligence Agency, the Culinary Institute of American, Certified Internal Auditor (CIA) Certification and allegedly, 224,000,000 more results.

If I search using “Central Intelligence Agency,” I get a “purer” stream of content on the Central Intelligence Agency, that runs from its official website,, to the Wikipedia article,, and ArtsBeat | Can’t Afford a Giacometti Sculpture? There’s Always the CIA’s bin Laden Action Figure .

Even with a detailed query Google search results remind me of a line from Saigon Warrior that goes:

But the organization is a god damned disgrace

If Larry Page thinks Google has “organise[d] the world’s information and ma[de] it universally accessible and useful,” he needs a reality check.

True, Google has gone further than any other enterprise towards indexing some of the world’s information, but hardly all of it nor is it usefully organized.

Why expand Google’s corporate mission when the easy part of the earlier mission has been accomplished and the hard part is about to start?

Perhaps some enterprising journalist will ask Page why Google is dodging the hard part of organizing information? Yes?

Death of Yahoo Directory

Sunday, October 26th, 2014

Progress Report: Continued Product Focus by Jay Rossiter, SVP, Cloud Platform Group.

From the post:

At Yahoo, focus is an important part of accomplishing our mission: to make the world’s daily habits more entertaining and inspiring. To achieve this focus, we have sunset more than 60 products and services over the past two years, and redirected those resources toward products that our users care most about and are aligned with our vision. With even more smart, innovative Yahoos focused on our core products – search, communications, digital magazines, and video – we can deliver the best for our users.

Directory: Yahoo was started nearly 20 years ago as a directory of websites that helped users explore the Internet. While we are still committed to connecting users with the information they’re passionate about, our business has evolved and at the end of 2014 (December 31), we will retire the Yahoo Directory. Advertisers will be upgraded to a new service; more details to be communicated directly.

Understandable but sad. Think of indexing a book that expanded as rapidly as the Internet over the last twenty (20) years. Especially if the content might or might not have any resemblance to already existing content.

Internet remains in serious need of a curated means to access quality information. Almost any search returns links ranging from high to questionable quality.

Imagine if Yahoo segregated the top 500 computer science publishers, archives, societies, departments, blogs into a block of searchable content. (The 500 number is wholly arbitrary, could be some other number) Users would pre-qualify themselves as interested in computer science materials and create a market segment for advertising purposes.

Users would get less trash in their results and advertisers would have pre-qualified targets.

A pre-curated search set might mean you would miss an important link, but realistically, few people read beyond the first twenty (20) links anyway. An analysis of search logs at PubMed show that 80% of users choose a link from the first twenty results.

In theory you may have > 10,000 “hits” but querying all of those up for serving to a user is a waste to time.

Suspect it varies by domain but twenty (20) high quality “hits” from curated content would be a far cry from average search results now.

I first saw this in Greg Linden’s Quick Links for Wednesday, October 01, 2014.

Compressed Text Indexes: From Theory to Practice!

Friday, October 3rd, 2014

Compressed Text Indexes:From Theory to Practice! Paolo Ferragina, Rodrigo Gonzalez, Gonzalo Navarro, Rossano Venturini.


A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.

The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner’s point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology.

A bit dated (2007) but definitely worth your attention. The “cited-by” results from the ACM Digital Library will bring you up to date.

BTW, I was pleased to find the Pizza&Chili Corpus: Compressed Indexes and their Testbeds, both Italian and Chilean mirrors are still online!

I have seen document links survive that long but rarely an online testbed.

Speedy Short and Long DNA Reads

Monday, August 25th, 2014

Acceleration of short and long DNA read mapping without loss of accuracy using suffix array by Joaquín Tárraga, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu553)


HPG Aligner applies suffix arrays for DNA read mapping. This implementation produces a highly sensitive and extremely fast mapping of DNA reads that scales up almost linearly with read length. The approach presented here is faster (over 20x for long reads) and more sensitive (over 98% in a wide range of read lengths) than the current, state-of-the-art mappers. HPG Aligner is not only an optimal alternative for current sequencers but also the only solution available to cope with longer reads and growing throughputs produced by forthcoming sequencing technologies.

Always nice to see an old friend, suffix arrays, in the news!

Source code:

For documentation and software:

I first saw this in a tweet by Bioinfocipf.


Sunday, August 17th, 2014

AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections, Jun-Yan Zhu, Yong Jae Lee, and Alexei Efros.


This paper proposes an interactive framework that allows a user to rapidly explore and visualize a large image collection using the medium of average images. Average images have been gaining popularity as means of artistic expression and data visualization, but the creation of compelling examples is a surprisingly laborious and manual process. Our interactive, real-time system provides a way to summarize large amounts of visual data by weighted average(s) of an image collection, with the weights reflecting user-indicated importance. The aim is to capture not just the mean of the distribution, but a set of modes discovered via interactive exploration. We pose this exploration in terms of a user interactively “editing” the average image using various types of strokes, brushes and warps, similar to a normal image editor, with each user interaction providing a new constraint to update the average. New weighted averages can be spawned and edited either individually or jointly. Together, these tools allow the user to simultaneously perform two fundamental operations on visual data: user-guided clustering and user-guided alignment, within the same framework. We show that our system is useful for various computer vision and graphics applications.

Applying averaging to images, particularly in an interactive context with users, seems like a very suitable strategy.

What would it look like to have interactive merging of proxies based on data ranges controlled by the user?

Cooper Hewitt, Color Interface

Tuesday, July 29th, 2014

From the about page:

Cooper Hewitt, Smithsonian Design Museum is the only museum in the nation devoted exclusively to historic and contemporary design. The Museum presents compelling perspectives on the impact of design on daily life through active educational and curatorial programming.

It is the mission of Cooper Hewitt’s staff and Board of Trustees to advance the public understanding of design across the thirty centuries of human creativity represented by the Museum’s collection. The Museum was founded in 1897 by Amy, Eleanor, and Sarah Hewitt—granddaughters of industrialist Peter Cooper—as part of The Cooper Union for the Advancement of Science and Art. A branch of the Smithsonian since 1967, Cooper-Hewitt is housed in the landmark Andrew Carnegie Mansion on Fifth Avenue in New York City.

I thought some background might be helpful because the Cooper Hewitt has a new interface:


Color, or colour, is one of the attributes we’re interested in exploring for collection browsing. Bearing in mind that only a fraction of our collection currently has images, here’s a first pass.

Objects with images now have up to five representative colors attached to them. The colors have been selected by our robotic eye machines who scour each image in small chunks to create color averages. These have then been harvested and “snapped” to the grid of 120 different colors — derived from the CSS3 palette and naming conventions — below to make navigation a little easier.

My initial reaction was to recall the old library joke where a patron comes to the circulation desk and doesn’t know a book’s title or author, but does remember it had a blue cover. 😉 At which point you wish Basil from Faulty Towers was manning the circulation desk. 😉

It may be a good idea with physical artifacts because color/colour is a fixed attribute that may be associated with a particular artifact.

If you know the collection, you can amuse yourself by trying to guess what objects will be returned for particular colors.

BTW, the collection is interlinked by people, roles, periods, types, countries. Very impressive!

Don’t miss the resources for developers at: and their GitHub account.

I first saw this in a tweet by Lyn Marie B.

PS: The use of people, roles, objects, etc. for browsing has a topic map-like feel. Since their data and other resources are downloadable, more investigation will follow.

Digital Commonplace Book?

Saturday, July 26th, 2014

Rick Minerich reviews a precursor to a digital commonplace book in Sony Digital Paper DPT-S1 at Lambda Jam 2014.

Limited to PDF files which you can highlight text, attach annotations (which can be exported), and you can use the DPT-S1 as a notepad.

To take the DTP-S1 a step further towards creating a commonplace book, it should:

  1. Export highlighted text with a reference to the text of origin
  2. Export annotated text with a reference to the text of origin
  3. Enable export target of note pages in the DPT-S1
  4. Enable pages that “roll” off the display (larger page sizes)
  5. Enable support of more formats

The first application (software or hardware) with reference preserving cut-n-paste from a variety of formats to the user’s note-taking format, will be a killer app.

And one step closer to being a digital commonplace book.

BTW, one authorized re-seller for the DPT-S1 has this notice on their website:

PLEASE NOTE: As of now we are only authorized to sell the Sony DPT-S1 within the Entertainment Industry. This is a pilot program and we are NO LONGER selling to the general public.

We understand that this is frustrating to many as this is a VERY popular product, however at this time we can provide NO INFORMATION regarding sales to the general public. This is a non-negotiable aspect of our agreement with Sony and regrettably, any inquiries by the general public will not be answered. Thank you for your understanding.
(Text color as it appears on the website.)

I can think of other words than “frustrating.”

Hopefully the popularity of the current version will encourage Sony to cure some of its limitations and make it more widely available.

The Sony Digital Paper site.

Resellers for legal and financial, law library, entertainment, and “all other professions.”

Or perhaps someone else will overcome the current limitations of the DPT-S1 and Sony will regret its overly restrictive marketing policies.

I first saw this in a tweet by Adam Foltzer.

Neo4j Index Confusion

Friday, July 25th, 2014

Neo4j Index Confusion by Nigel Small.

From the post:

Since the release of Neo4j 2.0 and the introduction of schema indexes, I have had to answer an increasing number of questions arising from confusion between the two types of index now available: schema indexes and legacy indexes. For clarification, these are two completely different concepts and are not interchangable or compatible in any way. It is important, therefore, to make sure you know which you are using.

Nigel forgets to mention that legacy indexes were based on Lucene, schema indexes, not.

If you are interested in the technical details of the schema indexes, start with On Creating a MapDB Schema Index Provider for Neo4j 2.0 by Michael Hunger.

Michael says in his tests that the new indexing solution is faster than Lucene. Or more accurately, faster than Lucene as used in prior Neo4j versions.

How simple are your indexing needs?

Flax Clade PoC

Monday, July 14th, 2014

Flax Clade PoC by Tom Mortimer.

From the webpage:

Flax Clade PoC is a proof-of-concept open source taxonomy management and document classification system, based on Apache Solr. In its current state it should be considered pre-alpha. As open-source software you are welcome to try, use, copy and modify Clade as you like. We would love to hear any constructive suggestions you might have.

Tom Mortimer

Taxonomies and document classification

Clade taxonomies have a tree structure, with a single top-level category (e.g. in the example data, “Social Psychology”). There is no distinction between parent and child nodes (except that the former has children) and the hierachical structure of the taxonomy is completely orthogonal from the node data. The structure may be freely edited.

Each node represents a category, which is represented by a set of “keywords” (words or phrases) which should be present in a document belonging to that category. Not all the keywords have to be present – they are joined with Boolean OR rather than AND. A document may belong to multiple categories, which are ranked according to standard Solr (TF-IDF) scoring. It is also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of the documents already in the category. This feature is currently slow as it uses the standard Solr MoreLikeThis component to analyse a large number of documents. We plan to improve this for a future release by writing a custom Solr plugin.

Documents are stored in a standard Solr index and are categorised dynamically as taxonomy nodes are selected. There is currently no way of writing the categorisation results to the documents in SOLR, but see below for how to export the document categorisation to an XML or CSV file.

A very interesting project!

I am particularly interested in the dynamic categorisation when nodes are selected.


Wednesday, May 14th, 2014

CrossClj: cross-referencing the clojure ecosystem

From the webpage:

CrossClj is a tool to explore the interconnected Clojure universe. As an example, you can find all the usages of the reduce function across all projects, or find all the functions called map. Or you can list all the projects using ring. You can also walk the source code across different projects.

Interesting search interface. You could lose some serious time just reading the project names. 😉

Makes me curious about the potential of listing functions and treating other functions/operators in their scope as facets?


Testing Lucene’s index durability after crash or power loss

Saturday, April 12th, 2014

Testing Lucene’s index durability after crash or power loss by Mike McCandless.

From the post:

One of Lucene’s useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

If anyone at your startup is writing an indexing engine, be sure to pass this post from Mike along.

Ask them for a demonstration of equal durability of the index before using their work instead of Lucene.

You have enough work to do without replicating (poorly) work that already has enterprise level reliability.

Google Search Appliance and Libraries

Monday, March 24th, 2014

Using Google Search Appliance (GSA) to Search Digital Library Collections: A Case Study of the INIS Collection Search by Dobrica Savic.

From the post:

In February 2014, I gave a presentation at the conference on Faster, Smarter and Richer: Reshaping the library catalogue (FSR 2014), which was organized by the Associazione Italiana Biblioteche (AIB) and Biblioteca Apostolica Vaticana in Rome, Italy. My presentation focused on the experience of the International Nuclear Information System (INIS) in using Google Search Appliance (GSA) to search digital library collections at the International Atomic Energy Agency (IAEA). 

Libraries are facing many challenges today. In addition to diminished funding and increased user expectations, the use of classic library catalogues is becoming an additional challenge. Library users require fast and easy access to information resources, regardless of whether the format is paper or electronic. Google Search, with its speed and simplicity, has established a new standard for information retrieval which did not exist with previous generations of library search facilities. Put in a position of David versus Goliath, many small, and even larger libraries, are losing the battle to Google, letting many of its users utilize it rather than library catalogues.

The International Nuclear Information System (INIS)

The International Nuclear Information System (INIS) hosts one of the world's largest collections of published information on the peaceful uses of nuclear science and technology. It offers on-line access to a unique collection of 3.6 million bibliographic records and 483,000 full texts of non-conventional (grey) literature. This large digital library collection suffered from most of the well-known shortcomings of the classic library catalogue. Searching was complex and complicated, it required training in Boolean logic, full-text searching was not an option, and response time was slow. An opportune moment to improve the system came with the retirement of the previous catalogue software and the adoption of Google Search Appliance (GSA) as an organization-wide search engine standard.

To be completely honest, my first reaction wasn’t a favorable one.

But even the complete blog post does not do justice to the project in question.

Take a look at the slides, which include screen shots of the new interface before reaching an opinion.

Take this as a lesson on what your search interface should be offering by default.

There are always other screens you can fill with advanced features.

Possible Elimination of FR and CFR indexes (Pls Read, Forward, Act)

Saturday, March 22nd, 2014

Possible Elimination of FR and CFR indexes

I don’t think I have ever posted with (Pls Read, Forward, Act) in the headline, but this merits it.

From the post:

Please see the following message from Emily Feltren, Director of Government Relations for AALL, and contact her if you have any examples to share.

Hi Advocates—

Last week, the House Oversight and Government Reform Committee reported out the Federal Register Modernization Act (HR 4195). The bill, introduced the night before the mark up, changes the requirement to print the Federal Register and Code of Federal Regulations to “publish” them, eliminates the statutory requirement that the CFR be printed and bound, and eliminates the requirement to produce an index to the Federal Register and CFR. The Administrative Committee of the Federal Register governs how the FR and CFR are published and distributed to the public, and will continue to do so.

While the entire bill is troubling, I most urgently need examples of why the Federal Register and CFR indexes are useful and how you use them. Stories in the next week would be of the most benefit, but later examples will help, too. I already have a few excellent examples from our Print Usage Resource Log – thanks to all of you who submitted entries! But the more cases I can point to, the better.

Interestingly, the Office of the Federal Register itself touted the usefulness of its index when it announced the retooled index last year:

Thanks in advance for your help!

Emily Feltren
Director of Government Relations

American Association of Law Libraries

25 Massachusetts Avenue, NW, Suite 500

Washington, D.C. 20001


This is seriously bad news so I decided to look up the details.

Federal Register

Title 44, Section 1504 Federal Register, currently reads in part:

Documents required or authorized to be published by section 1505 of this title shall be printed and distributed immediately by the Government Printing Office in a serial publication designated the ”Federal Register.” The Public Printer shall make available the facilities of the Government Printing Office for the prompt printing and distribution of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. The contents of the daily issues shall be indexed and shall comprise all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of distribution fixed by regulations under this chapter. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

The Public Printer shall make available the facilities of the Government Printing Office for the prompt publication of the Federal Register in the manner and at the times required by this chapter and the regulations prescribed under it. (Missing index language here.) The contents of the daily issues shall constitute all documents, required or authorized to be published, filed with the Office of the Federal Register up to the time of the day immediately preceding the day of publication fixed by regulations under this chapter.

Code of Federal Regulations (CFRs)

Title 44, Section 1510 Code of Federal Regulations, currently reads in part:

(b) (b) A codification published under subsection (a) of this section shall be printed and bound in permanent form and shall be designated as the ”Code of Federal Regulations.” The Administrative Committee shall regulate the binding of the printed codifications into separate books with a view to practical usefulness and economical manufacture. Each book shall contain an explanation of its coverage and other aids to users that the Administrative Committee may require. A general index to the entire Code of Federal Regulations shall be separately printed and bound. (emphasis added)

By comparison, H.R. 4195 — 113th Congress (2013-2014) reads in relevant part:

(b) Code of Federal Regulations.–A codification prepared under subsection (a) of this section shall be published and shall be designated as the `Code of Federal Regulations’. The Administrative Committee shall regulate the manner and forms of publishing this codification. (Missing index language here.)

I would say that indexes for the Federal Register and the Code of Federal Regulations are history should this bill pass as written.

Is this a problem?

Consider the task of tracking the number of pages in the Federal Register versus the pages in the Code of Federal Regulations that may be impacted:

Federal Register – > 70,000 pages per year.

The page count for final general and permanent rules in the 50-title CFR seems less dramatic than that of the oft-cited Federal Register, which now tops 70,000 pages each year (it stood at 79,311 pages at year-end 2013, the fourth-highest level ever). The Federal Register contains lots of material besides final rules. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

Code of Federal Regulations – 175,496 pages (2013) plus 1,170 page index.

Now, new data from the National Archives shows that the CFR stands at 175,496 at year-end 2013, including the 1,170-page index. (emphasis added) (New Data: Code of Federal Regulations Expanding, Faster Pace under Obama by Wayne Crews.)

The bottom line is there are 175,496 pages being impacted by more than 70,000 pages per year, published in a week-day publication.

We don’t need indexes to access that material?

Congress, I don’t think “access” means what you think it means.

PS: As a research guide, you are unlikely to do better than: A Research Guide to the Federal Register and the Code of Federal Regulations by Richard J. McKinney at the Law Librarians’ Society of Washington, DC website.

I first saw this in a tweet by Aaron Kirschenfeld.

Elasticsearch: The Definitive Guide

Friday, March 21st, 2014

Elasticsearch: The Definitive Guide (Draft)

From the Preface, who should read this book:

This book is for anybody who wants to put their data to work. It doesn’t matter whether you are starting a new project and have the flexibility to design the system from the ground up, or whether you need to give new life to a legacy system. Elasticsearch will help you to solve existing problems and open the way to new features that you haven’t yet considered.

This book is suitable for novices and experienced users alike. We expect you to have some programming background and, although not required, it would help to have used SQL and a relational database. We explain concepts from first principles, helping novices to gain a sure footing in the complex world of search.

The reader with a search background will also benefit from this book. Elasticsearch is a new technology which has some familiar concepts. The more experienced user will gain an understanding of how those concepts have been implemented and how they interact in the context of Elasticsearch. Even in the early chapters, there are nuggets of information that will be useful to the more advanced user.

Finally, maybe you are in DevOps. While the other departments are stuffing data into Elasticsearch as fast as they can, you’re the one charged with stopping their servers from bursting into flames. Elasticsearch scales effortlessly, as long as your users play within the rules. You need to know how to setup a stable cluster before going into production, then be able to recognise the warning signs at 3am in the morning in order to prevent catastrophe. The earlier chapters may be of less interest to you but the last part of the book is essential reading — all you need to know to avoid meltdown.

I fully understand the need, nay, compulsion for an author to say that everyone who is literate needs to read their book. And, if you are not literate, their book is a compelling reason to become literate! 😉

As the author of a book (two editions) and more than one standard, I can assure you an author’s need to reach everyone serves no one very well.

Potential readers ranges from novices, intermediate users and experts.

A book that targets all three will “waste” space on matter already know to experts but not to novices and/or intermediate users.

At the same time, space in a physical book being limited, some material relevant to the expert will be left out all together.

I had that experience quite recently when the details of LukeRequestHandler (Solr) were described as:

Reports meta-information about a Solr index, including information about the number of terms, which fields are used, top terms in the index, and distributions of terms across the index. You may also request information on a per-document basis.

That’s it. Out of more than 600+ pages of text, that is all the information you will find on LukeRequestHandler.

Fortunately I did find:

I don’t fault the author because several entire books could be written with the material they left out.

That is the hardest part of authoring, knowing what to leave out.

PS: Having said all that, I am looking forward to reading Elasticsearch: The Definitive Guide as it develops.

Full-Text-Indexing (FTS) in Neo4j 2.0

Wednesday, March 19th, 2014

Full-Text-Indexing (FTS) in Neo4j 2.0 by Michael Hunger.

From the post:

With Neo4j 2.0 we got automatic schema indexes based on labels and properties for exact lookups of nodes on property values.

Fulltext and other indexes (spatial, range) are on the roadmap but not addressed yet.

For fulltext indexes you still have to use legacy indexes.

As you probably don’t want to add nodes to an index manually, the existing “auto-index” mechanism should be a good fit.

To use that automatic index you have to configure the auto-index upfront to be a fulltext index and then secondly enable it in your settings.

Great coverage of full-text indexing in Neo4j 2.0.

Looking forward to spatial indexing. In the most common use case, think of it as locating assets on the ground relative to other actors. In real time.

Apache MarkMail

Friday, March 14th, 2014

Apache MarkMail

Just in case you don’t have your own index of the 10+ million messages in Apache mailing list archives, this is the site for you.


I ran across it today while debugging an error in a Solr config file.

If I could add one thing to MarkMail it would be software release date facets. Posts are not limited by release dates but I suspect a majority of posts between release dates are about the current release. Enough so that I would find it a useful facet.


Lucene 4 Essentials for Text Search and Indexing

Sunday, March 9th, 2014

Lucene 4 Essentials for Text Search and Indexing by Mitzi Morris.

From the post:

Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene.

Not too short! 😉

I have seen blurbs about Text Processing in Java but this post convinced me to put it on my wish list.


PS: As soon as a copy arrives I will start working on a review of it. If you want to see that happen sooner rather than later, ping me.