Full-Text Search « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 13, 2013

Client-side full-text search in CSS

Filed under: CSS3,Full-Text Search,Searching — Patrick Durusau @ 4:40 pm

Client-side full-text search in CSS by François Zaninotto.

Not really “full-text search” in any meaningful sense of the phrase.

But I can imagine it being very useful and the comments to his post about “appropriate” use of CSS are way off base.

The only value of CSS or Javascript or (fill in your favorite technology) is creation and/or delivery of content to a user.

Despite some naming issues, this has the potential to deliver content to users.

You may have other criteria that influence you choice of mechanisms but “appropriate” should not be one of them.

Comments Off

September 27, 2012

Couchbase and Full-text Search: The Couchbase Transport for Elastic Search

Filed under: Couchbase,ElasticSearch,Full-Text Search,Searching — Patrick Durusau @ 3:36 pm

Couchbase and Full-text Search: The Couchbase Transport for Elastic Search

From the post:

Couchbase Server 2.0 adds powerful indexing and querying capabilities through its distributed map reduce implementation. But in addition to that many applications, particularly content applications also need full-text search capabilities. Today we are releasing a developer preview of the Couchbase Transport Plugin for Elastic Search. This plugin uses the new Cross Data Center Replication functionality which will be a part of Couchbase Server 2.0. Using this new transport, you can get started with Couchbase and ElasticSearch easily. This blog explains how you can have this integration up and running in minutes.

There goes the weekend! Already!

Comments Off

July 25, 2012

Using MySQL Full-Text Search in Entity Framework

Filed under: Full-Text Search,MySQL,Searching,Text Mining — Patrick Durusau @ 6:14 pm

Using MySQL Full-Text Search in Entity Framework

Another database/text search post not for the faint of heart.

MySQL database supports an advanced functionality of full-text search (FTS) and full-text indexing described comprehensively in the documentation:

Full-Text Search Functions (MySQL 5.5 stable release)

Full-Text Search Functions (MySQL 5.6 development release)

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the full-text search functionality in our Devart dotConnect for MySQL ADO.NET Entity Framework provider.

Hard to say why Beyond Search picked up the Oracle post but left the MySQL one hanging.

I haven’t gone out and counted noses but I suspect there are a lot more installs of MySQL than Oracle 11g. Just my guess. Don’t buy or sell stock based on my guesses.

Comments Off

Using Oracle Full-Text Search in Entity Framework

Filed under: Full-Text Search,Oracle,Searching,Text Mining — Patrick Durusau @ 4:05 pm

Using Oracle Full-Text Search in Entity Framework

From the post:

Oracle database supports an advanced functionality of full-text search (FTS) called Oracle Text, which is described comprehensively in the documentation:

Oracle® Text Application Developer’s Guide 11g

Oracle® Text Reference 11g

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the basic Oracle Text functionality in our Devart dotConnect for Oracle ADO.NET Entity Framework provider.

Just in case you run across a client using Oracle to store text data.

I first saw this at Beyond Search (As Stephen implies, it is not a resource for casual data miners.)

Comments Off

December 4, 2010

Python Text Processing with NLTK Cookbook

Filed under: Clustering,Co-Words,Corpus Linguistics,Data Mining,Full-Text Search,Linguistic Metadata,Natural Language Processing,Text Analytics — Patrick Durusau @ 7:01 pm

Python Text Processing with NLTK Cookbook by Jacob Perkins.

Contents:

Chapter 1: Tokenizing Text and WordNet Basics

Chapter 2: Replacing and Correcting Words

Chapter 3: Creating Custom Corpora

Chapter 4: Part-of-Speech Tagging

Chapter 5: Extracting Chunks

Chapter 6: Transforming Chunks and Trees

Chapter 7: Text Classification

Chapter 8: Distributed Processing and Handling Large Datasets

Chapter 9: Parsing Specific Data

Appendix: Penn Treebank Part-of-Speech Tags

Index

A sample chapter, Chapter 3: Creating Custom Corpora is available for downloading.

Please post a link to your review of this work.

Even better, send me a copy and I will post a review. (I’m listed on Amazon.)

Comments (4)

Zoie: Real-time search indexing

Filed under: Full-Text Search,Indexing,Lucene,Search Engines,Software — Patrick Durusau @ 10:04 am

Zoie: Real-time search indexing

Somehow appropriate that following the lead on Kafka would lead me to Zoie (and other goodies to be reported).

From the website:

Zoie is a real-time search and indexing system built on Apache Lucene.

Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

News: Zoie 2.0.0 is released … – Compatible with Lucene 2.9.x.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Design Goals:

Additions of documents must be made available to searchers immediately
Indexing must not affect search performance
Additions of documents must not fragment the index (which hurts search performance)
Deletes and/or updates of documents must not affect search performance.

In topic map terms:

Additions to topic map must be made available to searchers immediately
Indexing must not affect search performance
Additions to topic map must not fragment the index (which hurts search performance)
Deletes and/or updates of a topic map must not affect search performance.

I would say that #’s 3 and 4 are research questions at this point.

Additions, updates and deletions in a topic map may have unforeseen (unforeseeable?) consequences.

Such as causing:

merging to occur
merging to be undone
roles to be played
roles to not be played
association to be valid
association to be invalid

to name only a few.

It may be possible to formally prove the impact that certain events will have but I am not aware of any definitive analysis on the subject.

Suggestions?

Comments Off

November 25, 2010

A Node Indexing Scheme for Web Entity Retrieval

Filed under: Entity Resolution,Full-Text Search,Indexing,Lucene,RDF,Topic Maps — Patrick Durusau @ 6:15 am

A Node Indexing Scheme for Web Entity Retrieval Authors(s): Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello Keywords: entity, entity search, full-text search, semi-structured queries, top-k query, node indexing, incremental index updates, entity retrieval system, RDF, RDFa, Microformats

Abstract:

Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.

Consider the requirements for this project:

Support for the multiple formats which are used on the Web of Data;

Support for searching an entity description given its characteristics (entity centric search);

Support for context (provenance) of information: entity descriptions are given in the context of a website or a dataset;

Support for semi-structural queries with full-text search, top-k query results, scalability over shard clusters of commodity machines, efficient caching strategy and incremental index maintenance.

(emphasis added)

SIREn { Semantic Information Retrieval Engine }

Definitely a package to download, install and start to evaluate. More comments forthcoming.

Questions (more for topic map researchers)

To what extent can “entity description” = properties of topics, associations, occurrences?
Can XTM, etc., be regarded as “microformats” for the purposes of SIREn?
To what extent does SIREn meet or exceed query requirements for XTM/TMDM based topic maps?
Reports on use of SIREn by topic mappers?

Comments (1)

September 29, 2010

LingPipe

Filed under: Classification,Clustering,Entity Extraction,Full-Text Search,Searching — Patrick Durusau @ 7:06 am

LingPipe.

The tutorial listing for LingPipe is the best summary of its capabilities.

Its sandbox is another “must see” location.

There may be better introductions to linguistic processing but I haven’t seen them.

Comments (3)

May 3, 2010

Search User Interfaces: Chapter 1 (Part 1)

Filed under: Full-Text Search,Information Retrieval,Search Engines,Search Interface,Searching — Patrick Durusau @ 7:59 pm

Chapter 1, The Design of Search User Interfaces of Hearst’s Search User Interfaces, surveys searching and related issues from a user interface perspective.

I needed the reminders about the need for simplicity in search interfaces and the shift in search interface design. (sections 1.1 – 1.2) If you think you have a “simple” interface for your topic map, read those two sections. Then read them again.

Design principles for user interface design (sections 1.3 – 1.4) is a good overview and contrast between user centered design and developers deciding what users need design. (Which one did you use?)

Feedback from search interfaces (section 1.5) ranges from the use of two dimensional representation of items as icons (against) to highlighting query terms, sorting and query term suggestions (generally favorable).

Let’s work towards having interfaces that are as attractive to users as our topic map applications are good at semantic integration.

Comments Off

April 12, 2010

Topic Maps and the “Vocabulary Problem”

Filed under: Full-Text Search,Heterogeneous Data,Information Retrieval,Search Engines,Searching,Semantic Diversity,Vocabulary Mismatch — Patrick Durusau @ 3:09 pm

To situate topic maps in a traditional area of IR (information retrieval), try the “vocabulary problem.”

Furnas describes the “vocabulary problem” as follows:

Many functions of most large systems depend on users typing in the right words. New or intermittent users often use the wrong words and fail to get the actions or information they want. This is the vocabulary problem. It is a troublesome impediment in computer interactions both simple (file access and command entry) and complex (database query and natural language dialog).

In what follows we report evidence on the extent of the vocabulary problem, and propose both a diagnosis and a cure. The fundamental observation is that people use a surprisingly great variety of words to refer to the same thing. In fact, the data show that no single access word, however well chosen, can be expected to cover more than a small proportion of user’s attempts. Designers have almost always underestimated the problem and, by assigning far too few alternate entries to databases or services, created an unnecessary barrier to effective use. Simulations and direct experimental tests of several alternative solutions show that rich, probabilistically weighted indexes or alias lists can improve success rates by factors of three to five.

The Vocabulary Problem in Human-System Communication (1987)

Substitute topic maps for probabilistically weighted indexes or alias lists. (Techniques we are going to talk about in connection with topic maps authoring.)

Three to five times greater success is an incentive to use topic maps.

Marketing Department Summary

Customers can’t buy what they can’t find. Topic Maps help customers find purchases, increases sales. (Be sure to track pre and post topic maps sales results. So marketing can’t successfully claim the increases are due to their efforts.)

Comments Off

April 5, 2010

Are You Designing a 10% Solution?

Filed under: Full-Text Search,Heterogeneous Data,Recall,Search Engines — Patrick Durusau @ 8:28 pm

The most common feature on webpages is the search box. It is supposed to help readers find information, products, services; in other words, help the reader or your cash flow.

How effective is text searching? How often will your reader use the same word as your content authors for some object, product, service? Survey says: 10 to 20%!*

So the next time you insert a search box on a webpage, you or your client may be missing 80 to 90% of the potential readers or customers. Ouch!

Unlike the imaginary world of universal and unique identifiers, the odds of users choosing the same words has been established by actual research.

The data sets were:

verbs used to describe text-editing operations
descriptions of common objects, similar to PASSWORD ™ game
superordinate category names for swap-and-sale listings
main-course cooking recipes

There are a number of interesting aspects to the study that I will cover in future posts but the article offers the following assessment of text searching:

We found that random pairs of people use the same word for an object only 10 to 20 percent of the time.

This research is relevant to all information retrieval systems. Online stores, library catalogs, whether you are searching simple text, RDF or even topic maps. Ask yourself or your users: Is a 10% success rate really enough?

(There ways to improve that 10% score. More on those to follow.)

*Furnas, G. W., Landauer, T. K., Gomez, L. M., Dumais, S. T., (1983) “Statistical semantics: Analysis of the potential performance of keyword information access systems.” Bell System Technical Journal, 62, 1753-1806. Reprinted in: Thomas, J.C., and Schneider, M.L, eds. (1984) Human Factors in Computer Systems. Norwood, New Jersey: Ablex Publishing Corp., 187-242.

Comments (3)

March 23, 2010

Full-Text Search “Logic”

Filed under: Full-Text Search — Patrick Durusau @ 6:38 pm

We justify full-text searching because users are unable to find a subject an index.

Let’s see:

Problem

Users don’t know what terms an indexer used for a subject in an index.

Solution

Users search full-text not knowing what terms hundreds if not thousands of people used for a subject.

*****

It may just be me but that sounds like the problem went from bad to worse.

There may be two separate but related saving graces to full-text searching:

As I pointed on in Is 00.7% of Relevant Documents Enough? a user may get lucky and guess a popular term or terms for some subject.
It is very unlikely that any user will enter a full-text search result and get no results.

Some of the questions raised: Is a somewhat useful result more important than a better result? How to measure the distance between the two? How much effort is acceptable to users to obtain a better result?

If you know of any research along those lines please let me know about it.

My suspicion is that the gap between actual and user estimates of retrieval (Size Really Does Matter…) says something very fundamental about users. Something we need to account for in search engines and interfaces.

Comments Off