Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 6, 2013

Lire

Filed under: Image Processing,Image Recognition,Searching — Patrick Durusau @ 6:29 pm

Lire

From the webpage:

LIRE (Lucene Image Retrieval) is an open source library for content based image retrieval, which means you can search for images that look similar. Besides providing multiple common and state of the art retrieval mechanisms LIRE allows for easy use on multiple platforms. LIRE is actively used for research, teaching and commercial applications. Due to its modular nature it can be used on process level (e.g. index images and search) as well as on image feature level. Developers and researchers can easily extend and modify Lire to adapt it to their needs.

The developer wiki & blog are currently hosted on http://www.semanticmetadata.net

An online demo can be found at http://demo-itec.uni-klu.ac.at/liredemo/

Lire will be useful if you start collecting images of surveillance cameras or cars going into or out of known alphabet agency parking lots.

August 1, 2013

Open Source Search FTW!

Filed under: Lucene,Search Engines,Searching,Solr — Patrick Durusau @ 4:40 pm

Open Source Search FTW! by Grant Ingersoll.

Abstract:

Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we’ll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.

If you aren’t already studying search engines, perhaps these slides will convince you to do so.

When you think about it, search precedes all other computer processing.

July 30, 2013

Lucene 4 Performance Tuning

Filed under: Indexing,Lucene,Performance,Searching — Patrick Durusau @ 6:47 pm

From the description:

Apache Lucene has undergone a major overhaul influencing many of the key characteristics dramatically. New features and modification allow for new as well as fundamentally different ways of tuning the engine for best performance.

Tuning performance is essential for almost every Lucene based application these days – Search & Performance almost a synonyms. Knowing the details of the underlying software provides the basic tools to get the best out of your application. Knowing the limitations can safe you and your company a massive amount of time and money. This talks tries to explain design decision made in Lucene 4 compared to older versions and provide technical details how those implementations and design decisions can help to improve the performance of your application. The talk will mainly focus on core features like: Realtime & Batch Indexing Filter and Query performance Highlighting and Custom Scoring

The talk will contain a lot of technical details that require a basic understanding of Lucene, datastructures and algorithms. You don’t need to be an expert to attend but be prepared for some deep dive into Lucene. Attendees don’t need to be direct Lucene users, the fundamentals provided in this talk are also essential for Apache Solr or elasticsearch users.

If you want to catch some of the highlights of Lucene 4, this is the presentation for you!

It will be hard to not go dig deeper in a number of areas.

The new codec features were particularly impressive!

July 24, 2013

Improve search relevancy…

Filed under: Relevance,Searching,Solr — Patrick Durusau @ 4:05 pm

Improve search relevancy by telling Solr exactly what you want by Doug Turnbull.

From the post:

To be successful, (e)dismax relies on avoiding a tricky problem with its scoring strategy. As we’ve discussed, dismax scores documents by taking the maximum score of all the fields that match a query. This is problematic as one field’s scores can’t easily be related to another’s. A good “text” match might have a score of 2, while a bad “title” score might be 10. Dismax doesn’t have a notion that “10” is bad for title, it only knows 10 > 2, so title matches dominate the final search results.

The best case for dismax is that there’s only one field that matches a query, so the resulting scoring reflects the consistency within that field. In short, dismax thrives with needle-in-a-haystack problems and does poorly with hay-in-a-haystack problems.

We need a different strategy for documents that have fields with a large amount of overlap. We’re trying to tell the difference between very similar pieces of hay. The task is similar to needing to find a good candidate for a job. If we wanted to query a search index of job candidates for “Solr Java Developer”, we’ll clearly match many different sections of our candidates’ resumes. Because of problems with dismax, we may end up with search results heavily sorted on the “objective” field.

(…)

Not unlike my comments yesterday about the similarity of searching and playing the lottery. The more you invest in the search, the more likely you are to get good results.

Doug analyzes what criteria should data meet in order to be a “good” result.

For a topic map, I would analyze what data does a subject need in order to be found by a typical request.

Both address the same problem, search, but from very different perspectives.

Exploring ElasticSearch

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 3:39 pm

Exploring ElasticSearch: A human-friendly tutorial for ElasticSearch. by Andrew Cholakian.

An incomplete tutorial on ElasticSearch.

However, unlike printed (dead tree) and pdf (dead electrons), you can suggest additional topics and I suspect that useful comments would be appreciated as well.

A “live” tutorial on popular software like ElasticSearch, that follows the software as it develops, could prove to be almost as popular as the software itself.

July 14, 2013

Looking ahead [Exploratory Merging?]

Filed under: Interface Research/Design,Merging,Searching — Patrick Durusau @ 6:31 pm

Looking ahead by Gene Golovchinsky.

From the post:

It is reasonably well-known that people who examine search results often don’t go past the first few hits, perhaps stopping at the “fold” or at the end of the first page. It’s a habit we’ve acquired due to high-quality results to precision-oriented information needs. Google has trained us well.

But this habit may not always be useful when confronted with uncommon, recall-oriented, information needs. That is, when doing research. Looking only at the top few documents places too much trust in the ranking algorithm. In our SIGIR 2013 paper, we investigated what happens when a light-weight preview mechanism gives searchers a glimpse at the distribution of documents — new, re-retrieved but not seen, and seen — in the query they are about to execute.

The preview divides the top 100 documents retrieved by a query into 10 bins, and builds a stacked bar chart that represents the three categories of documents. Each category is represented by a color. New documents are shown in teal, re-retrieved ones in the light blue shade, and documents the searcher has already seen in dark blue. The figures below show some examples:

(…)

The blog post is great but you really need to ready the SIGIR paper in full.

Speaking of exploratory searching, is anyone working on exploratory merging?

That is where a query containing a statement of synonymy or polysemy from a searcher results in exploratory merging of topics?

I am assuming that experts in a particular domain will see merging opportunities that eluded automatic processes.

Seems like a shame to waste their expertise, which could be captured to improve a topic map for future users.


The SIGIR paper:

Looking Ahead: Query Preview in Exploratory Search

Abstract:

Exploratory search is a complex, iterative information seeking activity that involves running multiple queries, finding and examining many documents. We introduced a query preview interface that visualizes the distribution of newly-retrieved and re-retrieved documents prior to showing the detailed query results. When evaluating the preview control with a control condition, we found effects on both people’s information seeking behavior and improved retrieval performance. People spent more time formulating a query and were more likely to explore search results more deeply, retrieved a more diverse set of documents, and found more different relevant documents when using the preview. With more time spent on query formulation, higher quality queries were produced and as consequence the retrieval results improved; both average residual precision and recall was higher with the query preview present.

Sherlock’s Last Case

Filed under: Erlang,Searching,Similarity — Patrick Durusau @ 1:03 pm

Sherlock’s Last Case by Joe Armstrong.

Joe states the Sherlock problem as given one X and millions of Yi’s, “Which Yi is “nearer to X?”

For some measure of “nearer,” or as we prefer, similarity.

One solution is given in Programming Erlang: Software for a Concurrent World, 2nd ed., 2013, by Joe Armstrong.

Joe describes two possibly better solutions in this lecture.

Great lecture even if he omits a fundamental weakness in TF-IDF.

From the Wikipedia entry:

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query “the brown cow”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “brown”, and “cow”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

However, because the term “the” is so common, this will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “brown” and “cow”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words “brown” and “cow”. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

For example, TF-IDF would not find a document with “the brown heifer,” for a query of “the brown cow.”

TF-IDF does not account for relationships between terms, such as synonymy or polysemy.

Juam Ramos states as much in describing the limitations of TF-IDF in: Using TF-IDF to Determine Word Relevance in Document Queries:

Despite its strength, TF-IDF has its limitations. In terms of synonyms, notice that TF-IDF does not make the jump to the relationship between words. Going back to (Berger & Lafferty, 1999), if the user wanted to find information about, say, the word ‘priest’, TF-IDF would not consider documents that might be relevant to the query but instead use the word ‘reverend’. In our own experiment, TF-IDF could not equate the word ‘drug’ with its plural ‘drugs’, categorizing each instead as separate words and slightly decreasing the word’s wd value. For large document collections, this could present an escalating problem.

Ramos cites Information Retrieval as Statistical Translation by Adam Berger and John Lafferty to support his comments on synonymy or polysemy.

The Berger and Lafferty treat synonymy and polysemy, issues that TF-IDF misses, as statistical translation issues:

Ultimately document retrieval systems must be sophisticated enough to handle polysemy and synonymyto know for instance that pontiff and pope are related terms The eld of statistical translation concerns itself with how to mine large text databases to automatically discover such semantic relations Brown et al [3, 4] showed for instance how a system can learn to associate French terms with their English translations given only a collection of bilingual FrenchEnglish sentences We shall demonstrate how in a similar fashion an IR system can from a collection of documents automatically learn which terms are related and exploit these relations to better nd and rank the documents it returns to the user

Merging powered by the results of statistical translation?

The Berger and Lafferty paper is more than a decade old so I will be running the research forward.

July 8, 2013

100 Search Engines For Academic Research

Filed under: Search Engines,Searching — Patrick Durusau @ 7:43 pm

100 Search Engines For Academic Research

From the post:

Back in 2010, we shared with you 100 awesome search engines and research resources in our post: 100 Time-Saving Search Engines for Serious Scholars. It’s been an incredible resource, but now, it’s time for an update. Some services have moved on, others have been created, and we’ve found some new discoveries, too. Many of our original 100 are still going strong, but we’ve updated where necessary and added some of our new favorites, too. Check out our new, up-to-date collection to discover the very best search engine for finding the academic results you’re looking for.

(…)

When I saw the title for this post I assumed it was source code for search engines. 😉

Not so!

But don’t despair!

Consider all of them as possible comparisons for your topic map interface.

Or should I say the results delivered by your topic map interface?

Some are better than others but I am sure you can do better with a curated topic map.

Detecting Semantic Overlap and Discovering Precedents…

Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.

Abstract:

Scientific literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-finding system that would take the text of an author’s early draft (or a submitted manuscript) and find potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.

Footnote one (1) of the paper gives an idea of the problem the authors face:

Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain specific requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientific discipline, and the decay rate is longer than in any scientific discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.

The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.

The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:

The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.

That’s the problem with the Semantic Web in a nutshell:

The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.

What if I want to be logically precise sometimes but not others?

What if I want to be more precise in some places and less precise in others?

What if I want to have different degrees or types of imprecision?

With topic maps the question is: How im/precise do you want to be?

July 7, 2013

Mini Search Engine…

Filed under: Graphs,Neo4j,Search Engines,Searching — Patrick Durusau @ 1:13 pm

Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog by Brian Du Preez.

From the post:

Continuing to chapter 4 of Programming Collection Intelligence (PCI) which is implementing a search engine.

I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn’t necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:
neo4j – learn
Graph theory for busy developers
Graphdatabases

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

The general search space is filled by major contenders.

But that leaves open opportunities for domain specific search services.

Law and medicine have specialized search engines. What commercially viable areas are missing them?

June 28, 2013

Apache Nutch v1.7 Released

Filed under: Nutch,Searching — Patrick Durusau @ 2:12 pm

Apache Nutch v1.7 Released

Main new feature is a pluggable indexing architecture that supports both Apache Solr and ElasticSearch.

Enjoy!

June 25, 2013

Apache Solr Reference Guide (Solr v4.3)

Filed under: Searching,Solr — Patrick Durusau @ 5:31 pm

Apache Solr Reference Guide (Solr v4.3) by Cassandra Targett.

From the TOC page:

Getting Started: This section guides you through the installation and setup of Solr.

Using the Solr Administration User Interface: This section introduces the Solr Web-based user interface. From your browser you can view configuration files, submit queries, view logfile settings and Java environment settings, and monitor and control distributed configurations.

Documents, Fields, and Schema Design: This section describes how Solr organizes its data for indexing. It explains how a Solr schema defines the fields and field types which Solr uses to organize data within the document files it indexes.

Understanding Analyzers, Tokenizers, and Filters: This section explains how Solr prepares text for indexing and searching. Analyzers parse text and produce a stream of tokens, lexical units used for indexing and searching. Tokenizers break field data down into tokens. Filters perform other transformational or selective work on token streams.

Indexing and Basic Data Operations: This section describes the indexing process and basic index operations, such as commit, optimize, and rollback.

Searching: This section presents an overview of the search process in Solr. It describes the main components used in searches, including request handlers, query parsers, and response writers. It lists the query parameters that can be passed to Solr, and it describes features such as boosting and faceting, which can be used to fine-tune search results.

The Well-Configured Solr Instance: This section discusses performance tuning for Solr. It begins with an overview of the solrconfig.xml file, then tells you how to configure cores with solr.xml, how to configure the Lucene index writer, and more.

Managing Solr: This section discusses important topics for running and monitoring Solr. It describes running Solr in the Apache Tomcat servlet runner and Web server. Other topics include how to back up a Solr instance, and how to run Solr with Java Management Extensions (JMX).

SolrCloud: This section describes the newest and most exciting of Solr’s new features, SolrCloud, which provides comprehensive distributed capabilities.

Legacy Scaling and Distribution: This section tells you how to grow a Solr distribution by dividing a large index into sections called shards, which are then distributed across multiple servers, or by replicating a single index across multiple services.

Client APIs: This section tells you how to access Solr through various client APIs, including JavaScript, JSON, and Ruby.

Well, I know what I am going to be reading in the immediate future. 😉

Hadoop for Everyone: Inside Cloudera Search

Filed under: Cloudera,Hadoop,Search Engines,Searching — Patrick Durusau @ 12:26 pm

Hadoop for Everyone: Inside Cloudera Search by Eva Andreasson.

From the post:

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!

As a workload natively integrated with CDH, Cloudera Search benefits from the same security model, access to the same data pool, and cost-efficient storage. In addition, it is added to the services monitored and managed by Cloudera Manager on the cluster, providing a unified production visibility and rich cluster management – a priceless tool for any cluster admin.

In the rest of this post, I’ll describe some of Cloudera Search’s most important features.

You have heard the buzz about Cloudera Search, now get a quick list of facts and pointers to more resources!

The most significant fact?

Cloudera Search uses Apache Solr.

If you are looking for search capabilities, what more need I say?

June 22, 2013

Tips for Tuning Solr Search: No Coding Required [June 25, 2013]

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 4:40 pm

Tips for Tuning Solr Search: No Coding Required

Date & time: Tuesday, June 25, 2013 01:00 PM EDT
Duration: 60 min
Speakers: Nick Veenhof, Senior Search Engineer, Acquia

Description:

Helping online visitors easily find what they’re looking for is key to a website’s success. In this webinar, you’ll learn how to improve search in ways that don’t require any coding or code changes. We’ll show you easy modifications to tune up the relevancy to more advanced topics, such as altering the display or configuring advanced facets.

Acquia’s Senior Search Engineer, Nick Veenhof , will guide you step by step through improving the search functionality of a website, using an in-house version of an actual conference site.

Some of the search topics we’ll demonstrate include:

  • Clean faceted URL’s
  • Adding sliders, checkboxes, sorting and more to your facets
  • Complete customization of your search displays using Display Suite
  • Tuning relevancy by using Solr optimization

This webinar will make use of the Facet API module suite in combination with the Apache Solr Search Integration module suite. We’ll also use some generic modules to improve the search results that are independent of the search technology that is used. All of the examples shown are fully supported by Acquia Search.

I haven’t seen a webinar from Acquia so going to take a chance and attend.

Some webinars are pure gold, others, well, extended infommercials at best.

Will be reporting back on the experience!


First complaint: Why the long registration form for the webinar? Phone number? What? Is your marketing department going to pester me into buying your product or service?

If you want to offer a webinar, name and email should be enough. You need to know how many attendees to allow for but more than that is a waste of your time and mine.

June 10, 2013

Advanced Suggest-As-You-Type With Solr

Filed under: Indexing,Searching,Solr — Patrick Durusau @ 10:10 am

Advanced Suggest-As-You-Type With Solr by John Berryman.

From the post:

In my previous post, I talked about implementing Search-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.

Here’s the use case: A user comes to your search-driven website to find something. And it is your goal to be as helpful as possible. Part of this is by making term suggestions as they type. When you make these suggestions, it is critical to make sure that your suggestion leads to search results. If you make a suggestion of a word just because it is somewhere in your index, but it is inconsistent with the other terms that the user has typed, then the user is going to get a results page full of white space and you’re going to get another dissatisfied customer!

A lot of search teams jump at the Solr suggester component because, after all, this is what it was built for. However I haven’t found a way to configure the suggester so that it suggests only completions that that correspond to search results. Rather, it is based upon a dictionary lookup that is agnostic of what the user is currently searching for. (Please someone tell me if I’m wrong!) In any case, getting the suggester working takes a bit of configuration. — Why not use a solution that is based upon the normal, out-of-the-box Solr setup. Here’s how:

Topic map authoring is what jumps to my mind as a use case for suggest-as-you-type. Particularly if you use fields for particular types of topics, making the suggestions more focused and manageable.

Good for search as well, for the same reasons.

John offers several cautions near the end of his post, but the final one is quite amusing:

Inappropriate Content: Be very cautious about content of the fields being used for suggestions. For instance, if the content has misspellings, so will the suggestions. And don’t include user comments unless you want to endorse their opinions and choice of language as your search suggestions!

I don’t think of auto-suggestions as “endorsing” anything. Purely mechanical assistance to assist the user.

If some term or opinion offends a user, they don’t have to choose it to follow.

At least in my view, technology should not be used to build intellectual tombs for users. Intellectual tombs that protect them from thoughts or expressions different from their own.

Search-As-You-Type With Solr

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 9:53 am

Search-As-You-Type With Solr by John Berryman.

From the post:

In my previous post, I talked about implementing Suggest-As-You-Type using Solr. In this post I’ll cover a closely related functionality called Suggest-As-You-Type.

Several years back, Google introduced an interesting new interface for their search called Search-As-You-Type. Basically, as you type in the search box, the result set is continually updated with better and better search results. By this point, everyone is used to Google’s Search-As-You-Type, but for some reason I have yet to see any of our clients use this interface. So I thought it would be cool to take a stab at this with Solr.

Let’s get started. First things first, download Solr and spin up Solr’s example.

cd solr-4.2.0/example
java -jar start.jar

Next click this link and POOF! you will have the following documents indexed:

  • There’s nothing better than a shiny red apple on hot summer day.
  • Eat an apple!
  • I prefer a Grannie Smith apple over Fuji.
  • Apricots is kinda like a peach minus the fuzz.

(Kinda cool how that link works isn’t it?) Now let’s work on the strategy. Let’s assume that the user is going to search for “apple”. When the user types “a” what should we do? In a normal index, there’s a buzillion things that start with “a”, so maybe we should just do nothing. Next “ap” depending upon how large your index is, two letters may be a reasonably small enough set to start providing feedback back to your users. The goal is to provide Solr with appropriate information so that it continuously comes back with the best results possible.

Good demonstration that how you form a query makes a large difference in the result you get.

June 6, 2013

Search is Not a Solved Problem

Filed under: Lucene,Searching,Solr — Patrick Durusau @ 1:58 pm

From the description:

The brief idea behind this talk is that search is not a solved problem — there is still a big opportunity for building search (and finding?) capabilities for the kinds of questions that the current product fail to solve. For example, why do search engines just return a list of sorted URLs, but give me no information about the themes that are consistent across them?

Hmmm, “…themes that are consistent across them?”

Do you think she means subjects across URLs?

😉

Important point: What people post isn’t the same content that they consume!

June 4, 2013

libsregex

Filed under: Regex,Searching — Patrick Durusau @ 2:22 pm

libsregex by Yichun Zhang.

From the homepage:

libsregex – A non-backtracking regex engine library for large data streams

And see:

Streaming regex matching and substitution by the sregex library by Yichun Zhang.

This looks quite good!

I first saw this at Nat Torkinton’s Four short links: 4 June 2013.

June 3, 2013

Searching With Hierarchical Fields Using Solr

Filed under: Searching,Solr — Patrick Durusau @ 3:05 pm

Searching With Hierarchical Fields Using Solr by John Berryman.

From the post:

In our recent and continuing effort to make the world a better place, we have been working with the illustrious Waldo Jaquith on a project called StateDecoded. Basically, we’re making laws easily searchable and accessible by the layperson. Here, check out the predecessor project, Virginia Decoded. StateDecoded will be similar to the predecessor but with extended and matured search capabilities. And instead of just Virginia state law, with StateDecoded, any municipality will be able to download the open source project index their own laws and give their citizens better visibility in to the rules that govern them.

For this post, though, I want to focus upon one of the good Solr riddlers that we encountered related to the hierarchical nature of the documents being indexed. Laws are divided into sections, chapters, and paragraphs and we have documents at every level. In our Solr, this hierarchy is captured in a field labeled “section”. So for instance, here are 3 examples of this section field:

  • <field name="section">30</field> – A document that contains information specific to section 30.
  • <field name="section">30.4</field> – A document that contains information specific to section 30 chapter 4.
  • <field name="section">30.4.15</field> – A document that contains information specific to section 30 chapter 4 paragraph 15.

And our goal for this field is that if anyone searches for a particular section of law, that they will be given the law most specific to their request followed by the laws that are less specific. For instance, if a user searches for “30.4″, then the results should contain the documents for section 30, section 30.4, section 30.4.15, section 30.4.16, etc., and the first result should be for 30.4. Other documents such as 40.4 should not be returned.

(…)

Excellent riddler!

I suspect the same issue comes up in other contexts as well.

June 2, 2013

solrconfig.xml: …

Filed under: Search Engines,Searching,Solr — Patrick Durusau @ 11:58 am

solrconfig.xml: Understanding SearchComponents, RequestHandlers and Spellcheckers by Mark Bennett.

I spend most of my configuration time in Solr’s schema.xml, but the solrconfig.xml is also a really powerful tool. I wanted to use my recent spellcheck configuration experience to review some aspects of this important file. Sure, solrconfig.xml lets you configure a bunch of mundane sounding things like caching policies and library load paths, but it also has some high-tech configuration “LEGO blocks” that you can mix and match and re-assemble into all kinds of interesting Solr setups.

What is spell checking if it isn’t validation of a name? 😉

If you like knowing the details, this is a great post!

May 27, 2013

Crawl-Anywhere

Filed under: Search Engines,Searching,Solr,Webcrawler — Patrick Durusau @ 1:24 pm

Crawl-Anywhere

From the webpage:

April 2013 – Starting version 4.0, Crawl-Anywhere becomes an open-source project. Current version is 4.0.0-alpha

Stable version 3.x is still available at http://www.crawl-anywhere.com/

(…)

Crawl Anywhere is mainly a web crawler. However, Crawl-Anywhere includes all components in order to build a vertical search engine.

Crawl Anywhere includes :

Project home page : http://www.crawl-anywhere.com/

A web crawler is a program that discovers and read all HTML pages or documents (HTML, PDF, Office, …) on a web site in order for example to index these data and build a search engine (like google). Wikipedia provides a great description of what is a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.

If you are gathering “very valuable intel” as in Snow Crash, a search engine will help.

Not do the heavy lifting but help.

May 21, 2013

Beyond Enterprise Search…

Filed under: Linked Data,MarkLogic,Searching,Semantic Web — Patrick Durusau @ 2:49 pm

Beyond Enterprise Search… by adamfowleruk.

From the post:

Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…

Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.

This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.

Content Search

We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.

Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.

Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.

These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.

Content search only gets you so far though.

I was amening with the best of them until Adam reached the part about MarkLogic 7 going to add Semantic Web capabilities. 😉

I didn’t see any mention of linked data replicating the semantic diversity that currently exists in data stores.

Making data more accessible isn’t going to make it less diverse.

Although making data more accessible may drive the development of ways to manage semantic diversity.

So perhaps there is a useful side to linked data after all.

May 17, 2013

Solr 4, the NoSQL Search Server [Webinar]

Filed under: NoSQL,Searching,Solr — Patrick Durusau @ 4:46 pm

Solr 4, the NoSQL Search Server by Yonik Seeley

Date: Thursday, May 30, 2013
Time: 10:00am Pacific Time

From the description:

The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get!

Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.
Featured Presenter:

Yonik Seeley – Research creator of Apache Solr and the Chief Open Source Architect and Co-Founder at LucidWorks. Mr. Seeley is an Apache Lucene/Solr PMC member and committer and an expert in distributed search systems architecture and performance. His work experience includes CNET Networks, BEA and Telcordia. He earned his M.S. in Computer Science from Stanford University.

This could be a real treat!

Notes on the webinar to follow.

May 16, 2013

Automated Archival and Visual Analysis of Tweets…

Filed under: Searching,Tweets — Patrick Durusau @ 7:24 pm

Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others by Stephen Turner.

From the post:

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I’ve been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter’s built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t – the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t‘s documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here’s the code as of May 14, 2013:

Stephen promises in his post that the script updates automatically and you may find “unsavory” tweets.

I didn’t but that may be a matter of happenstance or sensitivity. 😉

May 15, 2013

Keyword Search, Plus a Little Magic

Filed under: Keywords,Search Behavior,Searching — Patrick Durusau @ 3:34 pm

Keyword Search, Plus a Little Magic by Geoffrey Pullum.

From the post:

I promised last week that I would discuss three developments that turned almost-useless language-connected technological capabilities into something seriously useful. The one I want to introduce first was introduced by Google toward the end of the 1990s, and it changed our whole lives, largely eliminating the need for having full sentences parsed and translated into database query language.

The hunch that the founders of Google bet on was that simple keyword search could be made vastly more useful by taking the entire set of pages containing all of the list of search words and not just returning it as the result but rather ranking its members by influentiality and showing the most influential first. What a page contains is not the only relevant thing about it: As with any academic publication, who values it and refers to it is also important. And that is (at least to some extent) revealed in the link structure of the Web.

In his first post, which wasn’t sympathetic to natural language processing, Geoffrey baited his critics into fits of frenzied refutation.

Fits of refutation that failed to note Geoffrey hadn’t completed his posts on natural language processing.

Take the keyword search posting for instance.

I won’t spoil the surprise for you but the fourth fact that Geoffrey says Google relies upon could have serious legs for topic map authoring and interface design.

And not a little insight into what we call natural language processing.

More posts are to follow in this series.

I suggest we savor each one as it appears and after reflection on the whole, sally forth onto the field of verbal combat.

May 13, 2013

Seventh ACM International Conference on Web Search and Data Mining

Filed under: Conferences,Data Mining,Searching,WWW — Patrick Durusau @ 10:08 am

WSDM 2014 : Seventh ACM International Conference on Web Search and Data Mining

Abstract submission deadline: August 19, 2013
Paper submission deadline: August 26, 2013
Tutorial proposals due: September 9, 2013
Tutorial and paper acceptance notifications: November 25, 2013
Tutorials: February 24, 2014
Main Conference: February 25-28, 2014

From the call for papers:

WSDM (pronounced “wisdom”) is one of the premier conferences covering research in the areas of search and data mining on the Web. The Seventh ACM WSDM Conference will take place in New York City, USA during February 25-28, 2014.

WSDM publishes original, high-quality papers related to search and data mining on the Web and the Social Web, with an emphasis on practical but principled novel models of search, retrieval and data mining, algorithm design and analysis, economic implications, and in-depth experimental analysis of accuracy and performance.

WSDM 2014 is a highly selective, single track meeting that includes invited talks as well as refereed full papers. Topics covered include but are not limited to:

(…)

Papers emphasizing novel algorithmic approaches are particularly encouraged, as are empirical/analytical studies of specific data mining problems in other scientific disciplines, in business, engineering, or other application domains. Application-oriented papers that make innovative technical contributions to research are welcome. Visionary papers on new and emerging topics are also welcome.

Authors are explicitly discouraged from submitting papers that do not present clearly their contribution with respect to previous works, that contain only incremental results, and that do not provide significant advances over existing approaches.

Sets a high bar but one that can be met.

Would be very nice PR to have a topic map paper among those accepted.

May 10, 2013

Enigma

Filed under: Natural Language Processing,Ontology,Public Data,Searching — Patrick Durusau @ 6:29 pm

Enigma

I suppose it had to happen. With all the noise about public data sets that someone would create a startup to search them. 😉

Not a lot of detail at the site but you can sign up for a free trial.

Features:

100,000+ Public Data Sources: Access everything from import bills of lading, to aircraft ownership, lobbying activity,real estate assessments, spectrum licenses, financial filings, liens, government spending contracts and much, much more.

Augment Your Data: Get a more complete picture of investments, customers, partners, and suppliers. Discover unseen correlations between events, geographies and transactions.

API Access: Get direct access to the data sets, relational engine and NLP technologies that power Enigma.

Request Custom Data: Can’t find a data set anywhere else? Need to synthesize data from disparate sources? We are here to help.

Discover While You Work: Never miss a critical piece of information. Enigma uncovers entities in context, adding intelligence and insight to your daily workflow.

Powerful Context Filters: Our vast collection of public data sits atop a proprietary data ontology. Filter results by topics, tags and source to quickly refine and scope your query.

Focus on the Data: Immerse yourself in the details. Data is presented in its raw form, full screen and without distraction.

Curated Metadata: Source data is often unorganized and poorly documented. Our domain experts focus on sanitizing, organizing and annotating the data.

Easy Filtering: Rapidly prototype hypotheses by refining and shaping data sets in context. Filter tools allow the sorting, refining, and mathematical manipulation of data sets.

The “proprietary data ontology” jumps out at me as an obvious question. Do users get to know what the ontology is?

Not to mention the “our domain experts focus on sanitizing,….” Works for some cases, take legal research for example. Not sure that “your” experts works as well as “my” experts for less focused areas.

Looking forward to learning more about Enigma!

Moloch

Filed under: Cybersecurity,Searching — Patrick Durusau @ 5:00 pm

Moloch

From the webpage:

Moloch is an open source, large scale IPv4 packet capturing (PCAP), indexing and database system. A simple web interface is provided for PCAP browsing, searching, and exporting. APIs are exposed that allow PCAP data and JSON-formatted session data to be downloaded directly. Simple security is implemented by using HTTPS and HTTP digest password support or by using apache in front. Moloch is not meant to replace IDS engines but instead work along side them to store and index all the network traffic in standard PCAP format, providing fast access. Moloch is built to be deployed across many systems and can scale to handle multiple gigabits/sec of traffic.

Where do you think you are most likely to find dirty laundry?

In data you have been given permission to see?

Or, in data that others don’t want you to see?

Times up! 😉

I first saw this in Nat Torkington’s Four short links: 8 May 2013.

May 8, 2013

How Impoverished is the “current world of search?”

Filed under: Context,Indexing,Search Engines,Searching — Patrick Durusau @ 12:34 pm

Internet Content Is Looking for You

From the post:

Where you are and what you’re doing increasingly play key roles in how you search the Internet. In fact, your search may just conduct itself.

This concept, called “contextual search,” is improving so gradually the changes often go unnoticed, and we may soon forget what the world was like without it, according to Brian Proffitt, a technology expert and adjunct instructor of management in the University of Notre Dame’s Mendoza College of Business.

Contextual search describes the capability for search engines to recognize a multitude of factors beyond just the search text for which a user is seeking. These additional criteria form the “context” in which the search is run. Recently, contextual search has been getting a lot of attention due to interest from Google.

(…)

“You no longer have to search for content, content can search for you, which flips the world of search completely on its head,” says Proffitt, who is the author of 24 books on mobile technology and personal computing and serves as an editor and daily contributor for ReadWrite.com.

“Basically, search engines examine your request and try to figure out what it is you really want,” Proffitt says. “The better the guess, the better the perceived value of the search engine. In the days before computing was made completely mobile by smartphones, tablets and netbooks, searches were only aided by previous searches.

(…)

Context can include more than location and time. Search engines will also account for other users’ searches made in the same place and even the known interests of the user.

If time and location plus prior searches is context that “…flips the world of search completely on its head…”, imagine what a traditional index must do.

A traditional index being created by a person who has subject matter knowledge beyond the average reader and so is able to point to connections and facts (context) previously unknown to the user.

The “…current world of search…” is truly impoverished for time and location to have that much impact.

May 3, 2013

Is Search a Thing of the Past

Filed under: Marketing,Searching,Topic Maps — Patrick Durusau @ 4:12 pm

Is Search a Thing of the Past by April Holmes.

April covers a survey of 2277 private technology firms that were acquired in 2012.

See her post for the details but the bottom line was:

None of them were search companies.

I can’t remember anyone ever saying they had a “great” search experience.

Can you?

If not, what would you want to replace present search interfaces? (Leaving technical feasibility aside for the moment.)

« Newer PostsOlder Posts »

Powered by WordPress