Indexing « Another Word For It

August 16, 2011

Introduction to Random Indexing

Filed under: Indexing,Indirect Inference,Random Indexing — Patrick Durusau @ 7:04 pm

Introduction to Random Indexing by Magnus Sahlgren.

I thought this would be useful alongside Reflective Random Indexing and indirect inference….

Just a small sample of what you will find:

Note that this methodology constitutes a radically different way of conceptualizing how context vectors are constructed. In the “traditional” view, we first construct the co-occurrence matrix and then extract context vectors. In the Random Indexing approach, on the other hand, we view the process backwards, and first accumulate the context vectors. We may then construct a cooccurrence matrix by collecting the context vectors as rows of the matrix.

I like non-traditional approaches. Some work (like random indexing) and some don’t.

What new/non-traditional approaches have you tried in the last week? We learn as much (if not more) from failure as success.

Comments Off

Reflective Random Indexing and indirect inference…

Filed under: Associations,Data Mining,Distributional Semantics,Implicit Associations,Indexing,Indirect Inference,Literature-based Discovery — Patrick Durusau @ 7:04 pm

Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections by Trevor Cohen, Roger Schvaneveldt, Dominic Widdows.

Abstract:

The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.

The term “direct inference” is used for establishing a relationship between terms with a shared “bridging” term. That is the terms don’t co-occur in a text but share a third term that occurs in both texts. “Indirect inference,” that is finding terms with no shared “bridging” term is the focus of this paper.

BTW, if you don’t have access to the Journal of Biomedical Informatics version, try the draft: Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections

Comments (4)

August 13, 2011

CAS Registry Number & The Semantic Web

Filed under: Cheminformatics,Identifiers,Indexing — Patrick Durusau @ 3:47 pm

CAS Registry Number

Another approach to the problem of identification, assign an arbitrary identifier for which you hold the key.

If you start early enough in a particular era, you can gain enough of an advantage to deter most competitors. Particularly if you curate the professional literature so that you can provide effective searching based on your (and other) identifiers.

The similarity to the Semantic Web’s assignment of a URL to every subject is not accidental.

The main differences with the Semantic Web:

Economically important activity was focus of the project.
Professional literature base with obvious value-add potential for research and production.
Single source curators of the identifiers (did not whine at others to create them).
Identification where there was user demand to support the effort.

The Wiki page reports (in part):

CAS Registry Numbers are unique numerical identifiers assigned by the “Chemical Abstracts Service” to every chemical described in the open scientific literature (currently including those described from at least 1957 through the present) and including elements, isotopes, organic and inorganic compounds, organometallics, metals, alloys, coordination compounds, minerals, and salts; as well as standard mixtures, compounds, polymers; biological sequences including proteins & nucleic acids; nuclear particles, and nonstructurable materials (aka ‘UVCB’s- i.e., materials of Unknown, Variable Composition, or Biological origin). They are also referred to as CAS RNs, CAS Numbers, etc.

The Registry maintained by CAS is an authoritative collection of disclosed chemical substance information. Currently the CAS Registry identifies more than 56 million organic and inorganic substances and 62 million sequences, plus additional information about each substance; and the Registry is updated with an approximate 12,000 additional new substances daily.

Historically, chemicals have been identified by a wide variety of synonyms. Frequently these are arcane and constructed according to regional naming conventions relating to chemical formulae, structures or origins. Well-known chemicals may additionally be known via multiple generic, historical, commercial, and/or black-market names.

PS: The index is now at 61+ million substances.

Comments Off

July 28, 2011

Indexing in Cassandra

Filed under: Cassandra,Indexing — Patrick Durusau @ 6:55 pm

Indexing in Cassandra by Ed Anuff.

As if you haven’t noticed by now, I have a real weakness for indexing and indexing related material.

Interesting coverage of composite indexes.

Comments Off

July 27, 2011

Neo4j: Super-Nodes and Indexed Relationships

Filed under: Indexing,Lucene,Neo4j — Patrick Durusau @ 8:34 am

Neo4j: Super-Nodes and Indexed Relationships by Aleksa Vukotic.

From the post:

As part of my recent work for Open Credo, I have been involved in the project that was using Neo4J Graph database for application storage.

Neo4J is one of the first graph databases to appear on the global market. Being open source, in addition to its power and simplicity in supporting graph data model it represents good choice for production-ready graph database.

However, there has been one area I have struggled to get good-enough performance from Neo4j recently – super nodes.

Super nodes represent nodes with dense relationships (100K or more), which are quickly becoming bottlenecks in graph traversal algorithms when using Neo4J. I have tried many different approaches to get around this problem, but introduction of auto indexing in Neo4j 1.4 gave me an idea that I had success with. The approach I took is to try to fetch relationships of the super nodes using Lucene indexes, instead of using standard Neo APIs. In this entry I’ll share what I managed to achieve and how.

This looks very promising. Particularly the retrieval of only the relationships of interest for traversal. To me that suggests that we can keep indexes of relationships that may not be frequently consulted. I wonder if that means a facility to “expose” more or less relationships as the situation requires?

Comments Off

July 17, 2011

Building blocks of a scalable web crawler

Filed under: Indexing,NoSQL,Search Engines,Searching,SQL — Patrick Durusau @ 7:29 pm

Building blocks of a scalable web crawler Thesis by Marc Seeger. (2010)

Abstract:

The purpose of this thesis was the investigation and implementation of a good architecture for collecting, analysing and managing website data on a scale of millions of domains. The final project is able to automatically collect data about websites and analyse the content management system they are using.

To be able to do this efficiently, different possible storage back-ends were examined and a system was implemented that is able to gather and store data at a fast pace while still keeping it searchable.

This thesis is a collection of the lessons learned while working on the project combined with the necessary knowledge that went into architectural decisions. It presents an overview of the different infrastructure possibilities and general approaches and as well as explaining the choices that have been made for the implemented system.

From the conclusion:

The implemented architecture has been recorded processing up to 100 domains per second on a single server. At the end of the project the system gathered information about approximately 100 million domains. The collected data can be searched instantly and the automated generation of statistics is visualized in the internal web interface.

Most of your clients have lesser information demands but the lessons here will stand you in good stead with their systems too.

Comments Off

July 9, 2011

Indexing in Cassandra

Filed under: Cassandra,Indexing — Patrick Durusau @ 7:00 pm

Indexing in Cassandra

From the post:

I’m writing this up because there’s always quite a bit of discussion on both the Cassandra and Hector mailing lists about indexes and the best ways to use them. I’d written a previous post about Secondary indexes in Cassandra last July, but there are a few more options and considerations today. I’m going to do a quick run through of the different approaches for doing indexes in Cassandra so that you can more easily navigate these and determine what’s the best approach for your application.

Good article on indexes in Cassandra.

Comments Off

July 4, 2011

OrganiK Knowledge Management System

Filed under: Filters,Indexing,Knowledge Management,Recommendation,Text Analytics — Patrick Durusau @ 6:03 pm

OrganiK Knowledge Management System (wiki)

OrganiK Knowledge Management System (homepage)

I encountered the OrganiK project while searching for something else (naturally). 😉

From the homepage:

Objectives of the Project

The aim of the OrganiK project is to research and develop an innovative knowledge management system that enables the semantic fusion of enterprise social software applications. The system accumulates information that can be exchanged among one or several collaborating companies. This enables an effective management of organisational knowledge and can be adapted to functional requirements of smaller and knowledge-intensive companies.

More info..

Main distinguishing features

The set of OrganiK KM Client Interfaces comprises of a Wiki, a Blog, a Social Bookmarking and a Search Component that together constitute a Collaborative Workspace for SME knowledge workers. Each of the components consists of a Web-based client interface and a corresponding server engine.
The components that comprise the Business Logic Layer of the OrganiK KM Server are:

the Recommender System,

the Semantic Text Analyser,

the Collaborative Filtering Engine

the Full-text Indexer

More info…

Interesting project but the latest news item dates from 2008. Not encouraging.

I checked the source code and the most recent update was August, 2010. Much more encouraging.

Have written for more recent news.

Comments Off

July 1, 2011

Indexing The World Wide Web:…

Filed under: Indexing,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 2:57 pm

Indexing The World Wide Web: The Journey So Far by Abhishek Das and Ankit Jain.

Abstract:

In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.

A non-trivial survey of indexing the web attempts and issues. This is going to take a while to digest but it looks like a very good starting place to uncover what to try next.

Comments Off

Apache Lucene 3.3 / Solr 3.3

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 2:47 pm

Lucene 3.3 Announcement

Lucene Features:

The spellchecker module now includes suggest/auto-complete functionality, with three implementations: Jaspell, Ternary Trie, and Finite State.

Support for merging results from multiple shards, for both “normal” search results (TopDocs.merge) as well as grouped results using the grouping module (SearchGroup.merge, TopGroups.merge).

An optimized implementation of KStem, a less aggressive stemmer for English.

Single-pass grouping implementation based on block document indexing.

Improvements to MMapDirectory (now also the default implementation returned by FSDirectory.open on 64-bit Linux).

NRTManager simplifies handling near-real-time search with multiple search threads, allowing the application to control which indexing changes must be visible to which search requests.

TwoPhaseCommitTool facilitates performing a multi-resource two-phased commit, including IndexWriter.

The default merge policy, TieredMergePolicy, has a new method (set/getReclaimDeletesWeight) to control how aggressively it targets segments with deletions, and is now more aggressive than before by default.

PKIndexSplitter tool splits an index by a mid-point term.

Solr 3.3 Announcement

Solr Features:

Grouping / Field Collapsing

A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.

KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.

Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.

Important bugfixes, including extremely high RAM usage in spellchecking.

Bugfixes and improvements from Apache Lucene 3.3

Comments Off

June 22, 2011

Open Specification Interactive Pivot

Filed under: Indexing,Navigation,Silverlight — Patrick Durusau @ 6:39 pm

Open Specification Interactive Pivot

Uses Sliverlight technology to provide navigation across most of Microsoft’s Open Specifications (more are coming).

I had to switch to my IE (version 8) browser to get it to work but I guess it really didn’t need a “best if viewed with IE * or later” warning label. 😉

Impressive work and not just for the search/browsing capabilities. The more such information becomes available, the easier it is to illustrate the varying semantics even within one corporate development domain.

Not that varying semantics is a bad thing, on the contrary, they are perfectly natural. But in some cases we may need to overcome them for particular purposes. The first step in that process is recognition of the varying semantics.

Comments Off

June 21, 2011

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0

Filed under: Indexing,Lucene,Search Engines — Patrick Durusau @ 7:10 pm

Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0 by Simon Willnauer, Apache Lucene PMC.

Abstract:

Lucene 4.0 is on its way to deliver a tremendous amount of new features and improvements. Beside Real-Time Search & Flexible Indexing DocValues aka. Column Stride Fields is one of the “next generation” features. DocValues enable Lucene to efficiently store and retrieve type-safe Document & Value pairs in a column stride fashion either entirely memory resident random access or disk resident iterator based without the need to un-invert fields. It’s final goal is to provide a independently update-able per document storage for scoring, sorting or even filtering. This talk will introduce the current state of development, implementation details, its features and how DocValues have been integrated into Lucene’s Codec API for full extendability.

Excellent video!

Comments Off

June 7, 2011

Marketing Indexing

Filed under: Indexing,Marketing,Topic Maps — Patrick Durusau @ 6:25 pm

The episodic “oh, woe is topic maps! We aren’t as successful as ..(insert some semantic technology)..” posts are back on topicmapmail@infoloom.com. I don’t dispute that topic maps could improve its market share. I remember the “we’re #2, so we try harder” advertising campaign and take our present position as a reason to try harder, not to bewail our fate as ordained.

Let’s talk about how to market something closely related to topic maps, indexing.

I come to you with this great new idea, indexing. Instead of starting on page 1 and going through page n every time a reader wants to find information, the index points right to it. A real time saver.

You get excited and so we discuss two different marketing approaches:

1) We can do presentations, paper, demos, etc., on the theory of indexing, models of indexing, write software that does indexing, with a lot of effort, etc.

or,

2) We present a publisher/reviewer/reader with a book without an index and we have a copy of the same book with an index, plus a list of ten subjects to find in the book.

Show of hands. Which one do you think would be more effective?

Comments Off

June 6, 2011

Apache Lucene 3.2 / Solr 3.2

Filed under: Indexing,Lucene,Search Engines,Solr — Patrick Durusau @ 1:54 pm

Apache Lucene 3.2 / Solr 3.2 released!

From the website:

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:

A new grouping module, under lucene/contrib/grouping, enables search results to be grouped by a single-valued indexed field

A new IndexUpgrader tool fully converts an old index to the current format.

A new Directory implementation, NRTCachingDirectory, caches small segments in RAM, to reduce the I/O load for applications with fast NRT reopen rates.

A new Collector implementation, CachingCollector, is able to gather search hits (document IDs and optionally also scores) and then replay them. This is useful for Collectors that require two or more passes to produce results.

Index a document block using IndexWriter’s new addDocuments or updateDocuments methods. These experimental APIs ensure that the block of documents will forever remain contiguous in the index, enabling interesting future features like grouping and joins.

A new default merge policy, TieredMergePolicy, which is more efficient due to being able to merge non-contiguous segments. See http://s.apache.org/merging for details.

NumericField is now returned correctly when you load a stored document (previously you received a normal Field back, with the numeric value converted string).

Deleted terms are now applied during flushing to the newly flushed segment, which is more efficient than having to later initialize a reader for that segment.

Highlights of the Solr release include:

Ability to specify overwrite and commitWithin as request parameters when using the JSON update format.

TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.

DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString.

Improvements to the UIMA and Carrot2 integrations.

Highlighting performance improvements.

A test-framework jar for easy testing of Solr extensions.

Bugfixes and improvements from Apache Lucene 3.2.

Comments Off

DiscoverText Free Tutorial Webinar

Filed under: Classifier,Indexing,Searching,Software — Patrick Durusau @ 1:53 pm

DiscoverText Free Tutorial Webinar

Tuesday June 7 at 12:00 PM EST (Noon)

From the webinar announcement:

This Webinar introduces new and existing DiscoverText users to the basic document ingest, search & code features, takes your questions, and demonstrates our newest tool, a machine-learning classifier that is currently in beta testing. This is also a chance to preview our “New Navigation” and advanced filters.

DiscoverText’s latest additions to our “Do it Yourself” platform can be easily trained to perform customized mood, sentiment and topic classification. Any custom classification scheme or topic model can be created and implemented by the user. You can also generate tag clouds and drill into the most frequently occurring terms or use advanced search and filters to create “buckets” of text.

The system makes it possible to capture, share and crowd source text data analysis in novel ways. For example, you can collect text content off Facebook, Twitter & YouTube, as well as other social media or RSS feeds. Dataset owners can assign their “peers” to coding tasks. It is simple to measure the reliability of two or more coder’s choices. A distinctive feature is the ability to adjudicate coder choices for training purposes or to report validity by code, coder or project.

So please join us Tuesday June 7 at 12:00 PM EST (Noon) for an interactive Webinar. Find out why sorting thousands of items from social media, email and electronic document repositories is easier than ever. Participants in the Webinar will be invited to become beta testers of the new classification application.

I haven’t tried the software, free version or otherwise but will try to attend the webinar and report back.

Comments Off

May 18, 2011

ICON Programming for Humanists, 2nd edition

Filed under: Data Mining,Indexing,Text Analytics,Text Extraction — Patrick Durusau @ 6:50 pm

ICON Programming for Humanists, 2nd edition

From the foreword to the first edition:

This book teaches the principles of Icon in a very task-oriented fashion. Someone commented that if you say “Pass the salt” in correct French in an American university you get an A. If you do the same thing in France you get the salt. There is an attempt to apply this thinking here. The emphasis is on projects which might interest the student of texts and language, and Icon features are instilled incidentally to this. Actual programs are exemplified and analyzed, since by imitation students can come to devise their own projects and programs to fulfill them. A number of the illustrations come naturally enough from the field of Stylistics which is particularly apt for computerized approaches.

I can’t say that the success of ICON is a recommendation for task-oriented teaching but as I recall the first edition, I thought it was effective.

Data mining of texts is an important skill in the construction of topic maps.

This is a very good introduction to that subject.

Comments Off

April 26, 2011

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness – Post

Filed under: Associations,Indexing,Librarian/Expert Searchers — Patrick Durusau @ 2:16 pm

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

This is a must read post by Jeff Jonas.

I won’t spoil your fun but Jeff defines terms such as:

Context-less Card Catalogs
Semantically Reconciled Directories
Semantically Reconciled and Relationship Aware Directories

and a number of others.

Looks very much like he is interested in the same issues as topic maps.

Take the time to read it and see what you think.

Comments Off

April 22, 2011

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics

Filed under: Data Analysis,Hadoop,Indexing,MapReduce — Patrick Durusau @ 1:04 pm

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics by Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil.

Abstract:

MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one ineffcient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.

I always hope when I see first-class citizen(s) in CS papers that it is going to be talking about data structures and/or metadata (hopefully both).

Alas, I was disappointed once again but the paper is an interesting one and will repay close study.

Oh, the reason I mention treating data structures and metadata as first class citizens is then I can avoid the my way, your way or the highway sort of choices when it comes to metadata and formats.

Granted some formats maybe easier to use on some contexts, such as HDF5 (for space data), FITS (astronomical images), XML (for data and documents) or COBOL (for financial transactions), but if I can see formats as first class citizens, then I can map between them.

Not in a conversion sense, I can see them as though they are the same format as I prefer. Extract data from them, write data to them, etc.

Comments Off

April 4, 2011

Lucene Scoring API

Filed under: Indexing,Lucene — Patrick Durusau @ 6:34 pm

Lucene Scoring API

The documentation for the Lucene scoring API makes for very interesting reading.

In more ways that one.

Important for anyone who want to understand the scoring of documents by Lucene, which will influence the usefulness of searches in your particular domain.

But I think it is also important because it emphasizes that the scoring is for documents and not subjects.

A very useful thing to score documents, because it (hopefully) puts the most relevant ones to a search at or near the top of search results.

But isn’t that similar to the last mile problem with high speed internet delivery?

That is it is one thing to get high speed internet service to the local switching office. It is quite another to get it to each home, hence, the last mile problem.

An indexing solution like Lucene can, maybe, get you to the right document for a search but that leaves you to go the last mile in terms of finding the subject of interest in the article.

And, just as importantly, relating that subject to other information about the same subject.

True enough, I have been doing that very thing with print indexes and hard copy long before the arrival of widespread full text indexes and on-demand versions of texts.

It seems like a terrible waste of time and resources for everyone interested in a particular subject to have to dig information out of documents and then that cycle is repeated every time someone looks up that subject and finds a particular document.

We all keep running the last semantic mile.

The question is what would motivate us to shorten that to say the last 1/2 semantic mile, or less?

Comments Off

March 29, 2011

Reverted Indexing

Filed under: Indexing,Information Retrieval,Query Expansion — Patrick Durusau @ 12:47 pm

Reverted Indexing

From the website:

Traditional interactive information retrieval systems function by creating inverted lists, or term indexes. For every term in the vocabulary, a list is created that contains the documents in which that term occurs and its frequency within each document. Retrieval algorithms then use these term frequencies alongside other collection statistics to identify matching documents for a query.

Term-based search, however, is just one example of interactive information seeking. Other examples include offering suggestions of documents similar to ones already found, or identifying effective query expansion terms that the user might wish to use. More generally, these fall into several categories: query term suggestion, relevance feedback, and pseudo-relevance feedback.

We can combine the inverted index with the notion of retrievability to create an efficient query expansion algorithm that is useful for a number of applications, such as query expansion and relevance (and pseudo-relevance) feedback. We call this kind of index a reverted index because rather than mapping terms onto documents, it maps document ids onto queries that retrieved the associated documents.

As to its performance:

….the short answer is that our query expansion technique outperforms PL2 and Bose-Einstein algorithms (as implemented in Terrier) by 15-20% on several TREC collections. This is just a first stab at implementing and evaluating this indexing, but we are quite excited by the results.

An interesting example of innovative thinking about indexing.

With a useful result.

Comments Off

March 28, 2011

Watson – Indexing – Human vs. Computer

Filed under: Indexing,Searching — Patrick Durusau @ 10:07 am

In The importance of theories of knowledge: Indexing and information retrieval as an example¹, Birger Hjørland, reviews a deeply flawed study by Lykke and Eslau, Using Thesauri in Enterprise Settings: Indexing or Query Expansion?², which concludes in part:

As human indexing is costly, it could be useful and productive to use the human indexer to assign other types of metadata such as contextual metadata, and leave the subject indexing to the computer. (Lykke and Eslau, p. 94)

Hjørland outlines a number of methodological shortcomings of the study which I won’t repeat here.

I would add to the concerns voiced by Hjørland, the failure of the paper to account for known indexing issues such as encountered in Blair and Maron’s, An evaluation of retrieval effectiveness for a full-text document-retrieval system (see Size Really Does Matter…, which was published in 1985. If more than twenty-five years later, some researchers are not yet aware of the complexities indexing, one despairs of making genuine progress.

The Text REtrieval Conference (TREC) routinely discusses the complexities of indexing so it isn’t simply a matter of historical (I suppose 25 years qualifies as “historical” in a CS context) literature.

Lykke and Eslau don’t provide enough information to evaluate their findings but it appears they may have proven that it is possible for people to index so poorly that a computer search gives a better result.

Is that a Watson moment?

1. Hjørland, B. (2011). The importance of theories of knowledge: Indexing and information retrieval as an example. Journal of the American Society for Information Science & Technology, 62(1), 72-77.

2. Lykke, M., and Eslau, A.G. (2010). Using thesauri in enterprise settings: Indexing or query expansion? in B. Larsen, J.W. Schneider & F. Aström (Eds.), The Janus faced scholar. A feitschrift in honor of Peter Ingwesen (pp. 87-97). Compenhagen: Royal School of Library and Information Science. (Special volume of the ISSI e-newsletter, Vol. 06-S, June 2010). Retrieved March 25, 2011, from http://www.issi-society.info/peteringwersen/pdf_online.pdf

Comments Off

March 21, 2011

Homonymous Authors

Filed under: Homonymous,Indexing — Patrick Durusau @ 8:53 am

A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search.

Onodera, Natsuo, Mariko Iwasawa, Nobuyuki Midorikawa, Fuyuki Yoshikane, Kou Amano, Yutaka Ootani, Tadashi Kodama, Yasuhiko Kiyama, Hiroyuki Tsunoda, and Shizuka Yamazaki. 2011. “A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search.” Journal of the American Society for Information Science & Technology 62, no. 4: 677-690.

Abstact:

This paper proposes a methodology which discriminates the articles by the target authors (‘true’ articles) from those by other homonymous authors (‘false’ articles). Author name searches for 2,595 ‘source’ authors in six subject fields retrieved about 629,000 articles. In order to extract true articles from the large amount of the retrieved articles, including many false ones, two filtering stages were applied. At the first stage any retrieved article was eliminated as false if either its affiliation addresses had little similarity to those of its source article or there was no citation relationship between the journal of the retrieved article and that of its source article. At the second stage, a sample of retrieved articles was subjected to manual judgment, and utilizing the judgment results, discrimination functions based on logistic regression were defined. These discrimination functions demonstrated both the recall ratio and the precision of about 95% and the accuracy (correct answer ratio) of 90-95%. Existence of common coauthor(s), address similarity, title words similarity, and interjournal citation relationships between the retrieved and source articles were found to be the effective discrimination predictors. Whether or not the source author was from a specific country was also one of the important predictors. Furthermore, it was shown that a retrieved article is almost certainly true if it was cited by, or cocited with, its source article. The method proposed in this study would be effective when dealing with a large number of articles whose subject fields and affiliation addresses vary widely.

Interesting study of heuristics that may be of assistance in creating topic maps from academic literature.

I suspect there are other “patterns” as it were in other forms of information that await discovery.

Comments Off

MG4J – Managing Gigabytes for Java

Filed under: Indexing,Search Engines,Searching — Patrick Durusau @ 8:52 am

MG4J – Managing Gigabytes for Java

From the website:

The main points of MG4J are:

Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
Efficiency. We do not provide meaningless data such as “we index x GiB per second” (with which configuration? which language? which data source?)—we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents.
Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It’s up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
Multithreading. Indices can be queried and scored concurrently.
Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.

Comments Off

March 18, 2011

Complex Indexing?

Filed under: Indexing,Subject Identity,Topic Maps — Patrick Durusau @ 6:52 pm

The post The Joy of Indexing made me think about the original use case for topic maps, the merging of indexes prepared by different authors.

Indexing that relies either on a token in the text (simple indexing) or even a contextual clue, the compound indexing mentioned in the Joy of Indexing post, but fall short in terms of enabling the merging of indexes.

Why?

In my comments on the Joy of Indexing I mentioned that what we need is a subject indexing engine.

That is an engine that indexes the subjects that are appear in a text and not merely the manner of their appearance.

(Jack Park, topic map advocate and my friend would say I am hand waving at this point so perhaps an example will help.)

Say that I have a text where I use the words George Washington.

That could be a reference to the first president of the United States or it could be a reference to George Washington rabbit (my wife is a children’s librarian).

A simple indexing engine could not distinguish one from the other.

A compound indexing engine might list one under Presidents and the other under Characters but without more in the example we don’t know for sure.

A complex indexing engine, that is one that took into account more than simply the token in the text, say that it created its entry from that token plus other attributes of the subject it represents, would not mistake a president for a rabbit or vice versa.

Take Lucene for example. For any word in a text, it records

The position increment, start, and end offsets and payload are the only additional metadata associated with the token that is recorded in the index.

That pretty much isolates the problem is a nutshell. If that is all the metadata we get, which isn’t much, the likelihood we are going to do any reliable subject matching is pretty low.

Not to single Lucene out, I think all the search engines operate pretty much the same way.

To return to our example, what if while indexing, when we encounter George Washington, instead of the bare token we record, respectively:

George Washington – Class = Mammalia

Hmmm, that didn’t help much did it?

How about:

George Washington – Class = Mammalia Order = Primate

George Washington – Class = Mammalia Order = Lagomorpha

So that I can distinguish these two cases but can also ask for all instances of class = Mammalia.

Of course the trick is that no automated system is likely to make that sort of judgement reliably, at least left to its own devices.

But it doesn’t have to does it?

Imagine that I am interested in U.S. history and want to prepare an index of the Continental Congress proceedings. I could simply create an index by tokens but that will encounter all the problems we know that comes from merging indexes. Or searching across tokens as seen by such indexes. See Google for example.

But, what if I indexed the Continental Congress proceedings using more complex tokens? Ones that had multiple properties that could be indexed for one subject and that could exist in relationship to other subjects?

That is for some body of material, I declared the subjects that would be identified and what would be known about them post-identification?

A declarative model of subject identity. (There are other, equally legitimate, models of identity, that I will be covering separately.)

March 17, 2011

The Joy of Indexing

Filed under: Indexing,MongoDB,NoSQL,Subject Identity — Patrick Durusau @ 6:52 pm

The Joy of Indexing

Interesting blog post on indexing by Kyle Banker of MongoDB.

Recommended in part to understanding the limits of traditional indexing.

Ask yourself, what is the index in Kyle’s examples indexing?

Kyle says the example are indexing recipes but is that really true?

Or is it the case that the index is indexing the occurrence of a string at a location in the text?

Not exactly the same thing.

That is to say there is a difference between a token that appears in a text and a subject we think about when we see that token.

It is what enables us to say that two or more words that are spelled differently are synonyms.

Something other that the two words as strings is what we are relying on to make the claim they are synonyms.

A traditional indexing engine, of the sort described here, can only index the strings it encounters in the text.

What would be more useful would be an indexing engine that indexed the subjects in a text.

I think we would call such a subject-indexing engine a topic map engine. Yes?

Questions:

Do you agree/disagree that a word indexing engine is not a subject indexing engine? (3-5 pages, no citations)
What would you change about a word indexing engine (if anything) to make it a subject indexing engine? (3-5 pages, no citations)
What texts/subjects would you use as test cases for your engine? (3-5 pages, citations of the test documents)

Comments (1)

February 13, 2011

Apache Lucene 3.0 Tutorial

Filed under: Authoring Topic Maps,Indexing,Lucene — Patrick Durusau @ 1:34 pm

Apache Lucene 3.0 Tutorial by Bob Carpenter.

At 20 pages it isn’t your typical “Hello World” introduction. 😉

It should be the first document you hand a semi-technical person about Lucene.

Discovering the vocabulary of the documents/domain for which you are building a topic map is a critical first step.

Indexing documents gives you an important control over the accuracy and completeness of information you are given by domain “experts” and users.

There will be terms that are transparent to them and can only be clarified if you ask.

Comments Off

Text Analysis with LingPipe, Version 0.3

Filed under: Authoring Topic Maps,Indexing,LingPipe — Patrick Durusau @ 1:30 pm

Text Analysis with LingPipe 4 (Version 0.3) By Bob Carpenter.

On the importance of this book see: LingPipe Home.

Comments Off

February 4, 2011

Use The Index, Luke

Filed under: Indexing — Patrick Durusau @ 5:11 am

Use The Index, Luke (A Guide to SQL Performance by Markus Winand) is an interesting site devoted to improving the use of B-tree indexes in relational databases.

Improving the uses of indexes in general is a good idea and given the use of relational databases to persist topic maps, it seemed appropriate to mention it here.

It would be interesting to see comparisons of the uses of B-tree and other indexing structures for a known topic map.

Personally I suspect that the amount of local memory would make more of an impact than any algorithm but that would depend on whether access to stored topic maps is being measured or merging of topic maps. That is another open research question.

If this resource helps with planning persistence of your topic maps or if you have other comments about indexing and/or this resource, please post them.

Comments Off

January 25, 2011

NAQ Tree in Your Forest?

Filed under: Dimension Reduction,Equivalence Class,High Dimensionality,Indexing,NAQ-tree,Similarity — Patrick Durusau @ 10:39 am

Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space Authors: Ming Zhang and Reda Alhajj Keywords: Knn search, High dimensionality, Dimensionality reduction, Indexing, Similarity search

Abstract:

Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size of the data set require an index structure to facilitate the search. State-of-the-art index structures are built by partitioning the data set based on distances to certain reference point(s). Using the index, search is confined to a small number of partitions. However, these methods either ignore the property of the data distribution (e.g., VP-tree and its variants) or produce non-disjoint partitions (e.g., M-tree and its variants, DBM-tree); these greatly affect the search efficiency. In this paper, we study the effectiveness of a new index structure, called Nested-Approximate-eQuivalence-class tree (NAQ-tree), which overcomes the above disadvantages. NAQ-tree is constructed by recursively dividing the data set into nested approximate equivalence classes. The conducted analysis and the reported comparative test results demonstrate the effectiveness of NAQ-tree in significantly improving the search efficiency.

Although I think the following paragraph from the paper is more interesting:

Consider a set of objects O = {o1 , o2 , . . . , on } and a set of attributes A = {a1 , a2 , . . . , ad }, we first divide the objects into groups based on the first attribute a1 , i.e., objects with same value of a1 are put in the same group; each group is an equivalence class [23] with respect to a1 . In other words, all objects in a group are indistinguishable by attribute a1 . We can refine the equivalence classes further by dividing each existing equivalence class into groups based on the second attribute a2 ; all objects in a refined equivalence class are indistinguishable by attributes a1 and a2 . This process may be repeated by adding one more attribute at a time until all the attributes are considered. Finally, we get a hierarchical set of equivalence classes, i.e., a hierarchical partitioning of the objects. This is roughly the basic idea of NAQ-tree, i.e., to partition the data space in our similarity search method. In other words, given a query object o, we can gradually reduce the search space by gradually considering the most relevant attributes.

With the caveat that this technique is focused on metric spaces.

But I rather like the idea of reducing the search space by the attributes under consideration. Replace search space with similarity/sameness space and you will see what I mean. Still relevant for searching as well.

Comments Off

January 19, 2011

Curation is the New Search is the New Curation – Post

Filed under: Indexing,Search Engines,Search Interface,Searching — Patrick Durusau @ 1:22 pm

Curation is the New Search is the New Curation

Paul Kedrosky sees a return to curation as the next phase in searching. In part because search algorithms can be gamed…, but read the post. He has an interesting take on the problem.

The one comment I would add is that curation will mean not everything is curated.

Should it be?

What criteria would you use for excluding material to be curated from your index of (insert your favorite topic)?

Proposition: It is an error to think everything that can be searched is worth indexing (or curation).

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 16, 2011

August 13, 2011

July 28, 2011

July 27, 2011

July 17, 2011

July 9, 2011

July 4, 2011

July 1, 2011

June 22, 2011

June 21, 2011

June 7, 2011

June 6, 2011

May 18, 2011

April 26, 2011

April 22, 2011

April 4, 2011

March 29, 2011

March 28, 2011

March 21, 2011

March 18, 2011

March 17, 2011

February 13, 2011

February 4, 2011

January 25, 2011

January 19, 2011