Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 13, 2010

Search Potpourri: Pasties

Filed under: Humor,Search Engines — Patrick Durusau @ 11:51 am

Today’s search term: pasties

1-3: Err, nipple covers

4: a filled pastry case

5-8: nipple covers

9: a filled pastry case delivered

10. nipple cover

The other results alternative between those two subjects for the most part.

With the exception that a market has been created for flying pasties by airport scanners.

Can anyone suggest a search engine that doesn’t return both in the first page of “hits?”

December 11, 2010

Decomposer

Filed under: Matrix,Search Engines,Vectors — Patrick Durusau @ 1:19 pm

Decomposer

From the website:

Matrix algebra underpins the way many Big Data algorithms and data structures are composed: full-text search can be viewed as doing matrix multiplication of the term-document matrix by the query vector (giving a vector over documents where the components are the relevance score), computing co-occurrences in a collaborative filtering context (people who viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest) is taking the squaring the user-item interation matrix, calculating users who are k-degrees separated from each other in a social network or web-graph can be found by looking at the k-fold product of the graph adjacency matrix, and the list goes on (and these are all cases where the linear structure of the matrix is preserved!)
….
Currently implemented: Singular Value Decomposition using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb’s paper and there is a Lanczos implementation, both single-threaded, and in the contrib/hadoop subdirectory, as a hadoop map-reduce (series of) job(s). Coming soon: stochastic decomposition.

This code is in the process of being absorbed into the Apache Mahout Machine Learning Project.

Useful in learning to use search technology but also for recognizing at a very fundamental level, the limitations of that technology.

Document and query vectors are constructed without regard to the semantics of their components.

Using co-occurrence, for example, doesn’t give a search engine greater access to the semantics of the terms in question.

It simply makes the vector longer and so matches are less frequent and hopefully, less frequent = more precise.

That may or may not be the case. It also doesn’t account for case where the vectors are different but the subject in question is the same.

Search Potpourri: Madonna

Filed under: Humor,Search Engines — Patrick Durusau @ 11:41 am

Today’s search term: Madonna

1-9: You peeked! Yes, the material girl

10. http://www.madonna.edu One of these is not like the other

11-16: more material girl

17. where to have fun with a material girl madonna inn

18-28: even more material girl

29. A general purpose differential equation solver Berkeley Madonna

30. back to material girl

It may be my Roman Catholic background or perhaps the season, but I would have expected Madonna as in Madonna and child to have been in th top 30.

Would you believe that when I used the phrase Madonna and Child that not only did I get the more traditional Madonna but:

10. Material girl launches a clothing line

11. Video of Madonna and child, as in the Material girl and her child

Attention All Search Engines!

You are blocking anyway so why not serve up results that way?

Give one, maybe two links with some text for each block. Then users can choose one of those links or the block.

Sort of like more like this but better able to offer the user meaningful alternatives.

Plus, you can charge more to be the link that shows up in the block as opposed to simply being higher in an aggregation of err…links.

No charge. I use search engines a good bit and every improvement makes my life easier.

December 10, 2010

Decoding Searcher Intent: Is “MS” Microsoft Or Multiple Sclerosis? – Post

Filed under: Authoring Topic Maps,Interface Research/Design,Search Engines,Searching — Patrick Durusau @ 7:35 pm

Decoding Searcher Intent: Is “MS” Microsoft Or Multiple Sclerosis? is a great post from searchengineland.com.

Although focused on user behavior, as a guide to optimizing content for search engines, the same analysis is relevant for construction of topic maps.

A topic map for software help files is very unlikely to treat “MS” as anything other than Microsoft.

Even if those files might contain a references to Multiple Sclerosis, written as “MS.”

Why?

Because every topic map will concentrate its identification of subjects and relationships between subjects where there is the greatest return on investment.

Just as we have documentation rot now, there will be topic map rot as some subjects near the boundary of what is being maintained.

And some subjects won’t be identified or maintained at all.

Perhaps another class of digital have-nots.

Questions:

  1. Read the post and prepare a one page summary of its main points.
  2. What other log analysis would you use in designing a topic map? (3-5 pages, citations)
  3. Should a majority of user behavior/expectations drive topic map design? (3-5 pages, no citations)

Search Potpourri: …My Breast Fell Out

Filed under: Humor,Search Engines,Search Potpourri — Patrick Durusau @ 8:51 am

Today’s search term: breast

  1. Breast Implants $2500
  2. Breast – Wikipedia, the free encyclopedia
  3. Naked and Funny. Opps! My Breast Fell Out (video)
  4. Feel My Breasts (video)
  5. Breasts – sexual or for breastfeeding babies?
  6. AfraidtoAsk.com >> SIZE & SHAPE
  7. Show prosthetic breast, woman told at airport
  8. Fake Doctor Jailed For Giving Breast Exams In Bars

I suppose one could argue this result set offers something for everyone.

But, the rest of the results, at least up to the first 50, were as uneven as the first ten.

Hardly encouraging say for someone seeking serious medical information.

Clustering similar entities into collections would be one way to improve upon this result.

The related search function does that to a degree. But only to a degree.

More detailed navigation would be a good thing.

Perhaps high level collection views that can be “zoomed” into for more detailed browsing.

*****
Send your favorite search term(s)/phrases and a suggested search engine (remains anonymous for reporting) to: patrick@durusau.net.

Before anyone complains the search term was unfair, there were search engines that returned less varied results.

December 9, 2010

Mining of Massive Datasets – eBook

Mining of Massive Datasets

Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

A free eBook no less.

Read Jeff’s post on your way to get a copy.

Look for more comments as I read through it.

Has anyone written a comparison of the recent search engine titles? Just curious.


Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

December 8, 2010

Barriers to Entry in Search Getting Smaller – Post

Filed under: Indexing,Interface Research/Design,Search Engines,Search Interface,Searching — Patrick Durusau @ 9:49 am

Barriers to Entry in Search Getting Smaller

Jeff Dalton, Jeff’s Search Engine Caffè, makes a good argument that the barriers to entering the search market are getting smaller.

Jeff observes that blekko can succeed with a small number of servers only because its search demand is low.

True, but how many intra-company or litigation search engines are going to have web-sized user demands?

Start-ups need not try to match Google in its own space, but can carve out interesting and economically rewarding niches of their own.

Particularly if those niches involve mapping semantically diverse resources into useful search results for their users.

For example, biomedical researchers probably have little interest in catalog entries that happen to match gene names. Or any of the other common mis-matches offered by entire web search services.

In some ways, search the entire web services have created their own problem and then attempted to solve it.

My research interests are in information retrieval broadly defined so a search engine limited to library schools, CS programs (their faculty and students), the usual suspects for CS collections, library/CS/engineering organizations, with semantic mapping, would suit me just find.

Noting that the semantic mis-match problem persists even with a narrowing of resources, but the benefit of each mapping is incrementally greater.

Questions:

  1. What resources are relevant to your research interests? (3-5 pages, web or other citations)
  2. Create a Google account to create your own custom search engine and populate it with your resources.
  3. Develop and execute 20 queries against your search engine and Google, Bing and one other search engine of your choice. Evaluate and report the results of those queries.
  4. Would semantic mapping such as we have discussed for topic maps be more or less helpful with your custom search engine versus the others you tried? (3-5 pages, no citations)

December 4, 2010

Zoie: Real-time search indexing

Filed under: Full-Text Search,Indexing,Lucene,Search Engines,Software — Patrick Durusau @ 10:04 am

Zoie: Real-time search indexing

Somehow appropriate that following the lead on Kafka would lead me to Zoie (and other goodies to be reported).

From the website:

Zoie is a real-time search and indexing system built on Apache Lucene.

Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

News: Zoie 2.0.0 is released … – Compatible with Lucene 2.9.x.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Design Goals:

  • Additions of documents must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions of documents must not fragment the index (which hurts search performance)
  • Deletes and/or updates of documents must not affect search performance.

In topic map terms:

  • Additions to topic map must be made available to searchers immediately
  • Indexing must not affect search performance
  • Additions to topic map must not fragment the index (which hurts search performance)
  • Deletes and/or updates of a topic map must not affect search performance.

I would say that #’s 3 and 4 are research questions at this point.

Additions, updates and deletions in a topic map may have unforeseen (unforeseeable?) consequences.

Such as causing:

  • merging to occur
  • merging to be undone
  • roles to be played
  • roles to not be played
  • association to be valid
  • association to be invalid

to name only a few.

It may be possible to formally prove the impact that certain events will have but I am not aware of any definitive analysis on the subject.

Suggestions?

November 25, 2010

Sig.ma – Live views on the Web of Data

Filed under: Indexing,Information Retrieval,Lucene,Mapping,RDF,Search Engines,Semantic Web — Patrick Durusau @ 10:27 am

Sig.ma – Live views on the Web of Data

From the website:

In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.

Read one of various versions of an article on Sig.ma for the technical details.

From the Web Technologies article cited on the homepage:

Sig.ma revolves around the creation of Entity Profiles. An entity profile – which in the Sig.ma dataflow is represented by the “data cache” storage (Fig. 3) – is a summary of an entity that is presented to the user in a visual interface, or which can be returned by the API as a rich JSON object or a RDF document. Entity profiles usually include information that is aggregated from more than one source. The basic structure of an entity profile is a set of key-value pairs that describe the entity. Entity profiles often refer to other entities, for example the profile of a person might refer to their publications.

No, this isn’t an implementation of the TMRM.

This is an implementation of one way to view entities for a particular type of data. A very exciting one but still limited to a particular data set.

This is a big step forward.

For example, it isn’t hard to imagine entity profiles against particular websites or data sets. Entity profiles that are maintained and leased for use with search engines like Sig.ma.

Or going a bit further and declaring a basis for identification of subjects, such as the existence of properties a…n in an RDF graph.

Questions:

  1. Spend a couple of hours with Sig.ma researching library related questions. (Discussion)
  2. What did you like, dislike or find surprising about Sig.ma? (3-5 pages, no citations)
  3. Entity profiles for library science (Class project)

Sig.ma: Live Views on the web of data – bibliography issues

I normally start with a DOI here so you can see article in question.

Not here.

Here’s why:

Sig.ma: Live views on the Web of Data Journal of Web Semantics. (10 pages)

Sig.ma: Live Views on the Web of Data WWW ’10 Proceedings(demo, 4 pages)

Sig.ma: Live Views on the Web of Data (8 pages) http://richard.cyganiak.de/2008/papers/sigma-semwebchallenge2009.pdf

Sig.ma: Live Views on the Web of Data (4 pages) http://richard.cyganiak.de/2008/papers/sigma-demo-www2010.pdf

Sig.ma: Live Views on the Web of Data (25 pages) http://fooshed.net/paper/JWS2010.pdf

Before saying anything ugly, ;-), this is some of the most exciting research I have seen in a long time. I will cover that part of it in a following post. But, to the matter at hand, bibliographic control.

Five (5) different articles, two published in recognized journals that all have the same name? (The demo articles are the same but have different headers/footers, page numbers and so would likely be indexed as different articles.)

I will be able to resolve any confusion by obtaining the article in question.

But that isn’t an excuse.

I, along with everyone else interested in this research, will waste a small part of our time resolving the confusion. Confusion that could have been avoided for everyone.

Not unlike everyone who does the same search having to tread the same google glut.

With no way to pass on what we have resolved, for the benefit of others.

Questions:

  1. Help these authors out. How would you suggest they avoid this in the future? Use of the name is important. (3-5 pages, no citations)
  2. Help the library out. How will you deal with multiple papers with the same title, authors, pub year? (this isn’t uncommon) (3-5 pages, citations optional)
  3. How would you use topic maps to resolve this issue? (3-5 pages, no citations)

November 2, 2010

Advances in Clustering Search

Filed under: Search Engines,Searching — Patrick Durusau @ 6:04 am

Advances in Clustering Search Authors: Tarcisio Souza Costa, Alexandre César Muniz Oliveira, Luiz Antonio Nogueira Lorena Keywords: Clustering Search, search subspaces, combinatorial optimisation, population metaheuristics, evolutionary algorithms.

Abstract:

The Clustering Search (*CS) has been proposed as a generic way of combining search metaheuristics with clustering to detect promising search areas before applying local search procedures. The clustering process may keep representative solutions associated to different search subspaces. Although, recent applications have reached success in combinatorial optimisation problems, nothing new has arisen concerning diversification issues when population metaheuristics, as evolutionary algorithms, are being employed. In this work, recent advances in the *CS are commented and new features are proposed, including, the possibility of keeping population diversified for more generations.

The use of different search strategies for clusters of interest was the most interesting aspect of this article.

Not that big of a step to imagine different subject or association recognition routines depending upon the type of cluster.

October 30, 2010

8 Keys to Findability

8 Keys to Findability mentions in closing:

The average number of search terms is about 1.7 words, which is not a lot when searching across millions of documents. Therefore, a conversation type of experience where users can get feedback from the results and refine their search makes for the most effective search results.

I have a different take on that factoid.

The average user needs only 1.7 words to identify a subject of interest to them.

Why the gap between 1.7 words and the number of words required for “effective search results?”

Why ask?

Returning millions of “hits” is on 1.7 words is meaningless.

Returning the ten most relevant “hits” on 1.7 words is a G***** killer.

October 28, 2010

LDSpider

Filed under: Linked Data,Search Engines,Searching,Semantic Web — Patrick Durusau @ 5:11 am

LDSpider.

From the website:

The LDSpider project aims to build a web crawling framework for the linked data web. Requirements and challenges for crawling the linked data web are different from regular web crawling, thus this projects offer a web crawler adapted to traverse and harvest sources and instances from the linked data web. We offer a single jar which can be easily integrated into own applications.

Features:

  • Content Handlers for different formats
  • Different crawling strategies
  • Crawling scope
  • Output formats

Content handlers, crawling strategies, crawling scope, output formats, all standard crawling features. Adapted to linked data formats but those formats should be accessible to any crawler.

A welcome addition since we are all going to encounter linked data but I am missing what is different?

If you see it, please post a comment.

Questions:

  1. What semantic requirements should a web crawler have?
  2. How does this web crawler compare to your requirements?
  3. What one capacity would you add to this crawler?
  4. What other web crawlers should be used for comparison?

October 25, 2010

The Short Comings of Full-Text Searching

The Short Comings of Full-Text Searching by Jeffrey Beall from the University of Colorado Denver.

  1. The synonym problem.
  2. Obsolete terms.
  3. The homonym problem.
  4. Spamming.
  5. Inability to narrow searches by facets.
  6. Inability to sort search results.
  7. The aboutness problem.
  8. Figurative language.
  9. Search words not in web page.
  10. Abstract topics.
  11. Paired topics.
  12. Word lists.
  13. The Dark Web.
  14. Non-textual things.

Questions:

  1. Watch the slide presentation.
  2. Can you give three examples of each short coming? (excluding #5 and #6, which strike me as interface issues, not searching issues)
  3. How would you “solve” the word list issue? (Don’t assume quantum computing, etc. There are simpler answers.)
  4. Is metadata the only approach for “non-textual things?” Can you cite 3 papers offering other approaches?

October 6, 2010

Mining Historic Query Trails to Label Long and Rare Search Engine Queries

Filed under: Authoring Topic Maps,Data Mining,Entity Extraction,Search Engines,Searching — Patrick Durusau @ 7:05 am

Mining Historic Query Trails to Label Long and Rare Search Engine Queries Authors: Peter Bailey, Ryen W. White, Han Liu, Giridhar Kumaran Keywords: Long queries, query labeling

Abstract:

Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user’s search intent that may not be apparent from the terms in the query. This affords search result matching or reranking based on queries and labels rather than the query text alone. Query labels can be derived from interaction logs generated from many users’ search result clicks or from query trails comprising the chain of URLs visited following query submission. However, since long queries are typically rare, they are difficult to label in this way because little or no historic log data exists for them. A subset of these queries may be amenable to labeling by detecting similarities between parts of a long and rare query and the queries which appear in logs. In this article, we present the comparison of four similarity algorithms for the automatic assignment of Open Directory Project category labels to long and rare queries, based solely on matching against similar satisfied query trails extracted from log data. Our findings show that although the similarity-matching algorithms we investigated have tradeoffs in terms of coverage and accuracy, one algorithm that bases similarity on a popular search result ranking function (effectively regarding potentially-similar queries as “documents”) outperforms the others. We find that it is possible to correctly predict the top label better than one in five times, even when no past query trail exactly matches the long and rare query. We show that these labels can be used to reorder top-ranked search results leading to a significant improvement in retrieval performance over baselines that do not utilize query labeling, but instead rank results using content-matching or click-through logs. The outcomes of our research have implications for search providers attempting to provide users with highly-relevant search results for long queries.

(Apologies for repeating the long abstract but this needs wider notice.)

What the authors call “label prediction algorithms,” is a step in mining data for subjects.

The research may also improve search results through the use of labels for ranking.

October 2, 2010

Facilitating exploratory search by model-based navigational cues

Filed under: Interface Research/Design,Search Engines,Search Interface,Searching — Patrick Durusau @ 4:34 am

Facilitating exploratory search by model-based navigational cues Authors: Wai-Tat Fu, Thomas G. Kannampallil, Ruogu Kang Keywords: exploratory learning, knowledge exchange, semantic imitation, SNIF-ACT, social tagging

Abstract:

We present an extension of a computational cognitive model of social tagging and exploratory search called the semantic imitation model. The model assumes a probabilistic representation of semantics for both internal and external knowledge, and utilizes social tags as navigational cues during exploratory search. We used the model to generate a measure of information scent that controls exploratory search behavior, and simulated the effects of multiple presentations of navigational cues on both simple information retrieval and exploratory search performance based on a previous model called SNIF-ACT. We found that search performance can be significantly improved by these model-based presentations of navigational cues for both experts and novices. The result suggested that exploratory search performance depends critically on the match between internal knowledge (domain expertise) and external knowledge structures (folksonomies). Results have significant implications on how social information systems should be designed to facilitate knowledge exchange among users with different background knowledge.

Not all users require (or can use) the same clues.

Something to think about when designing the interface, for topic maps or elsewhere.

DocuBrowse: faceted searching, browsing, and recommendations in an enterprise context

DocuBrowse: faceted searching, browsing, and recommendations in an enterprise context Authors: Andreas Girgensohn, Frank Shipman, Francine Chen, Lynn Wilcox Keywords: document management, document recommendation, document retrieval, document visualization, faceted search

Abstract:

Browsing and searching for documents in large, online enterprise document repositories are common activities. While internet search produces satisfying results for most user queries, enterprise search has not been as successful because of differences in document types and user requirements. To support users in finding the information they need in their online enterprise repository, we created DocuBrowse, a faceted document browsing and search system. Search results are presented within the user-created document hierarchy, showing only directories and documents matching selected facets and containing text query terms. In addition to file properties such as date and file size, automatically detected document types, or genres, serve as one of the search facets. Highlighting draws the user’s attention to the most promising directories and documents while thumbnail images and automatically identified keyphrases help select appropriate documents. DocuBrowse utilizes document similarities, browsing histories, and recommender system techniques to suggest additional promising documents for the current facet and content filters.

Watch the movie of this interface in action at the ACM page.

Then imagine it with collaboration and subject identity.

Towards a reputation-based model of social web search

Towards a reputation-based model of social web search Authors: Kevin McNally, Michael P. O’Mahony, Barry Smyth, Maurice Coyle, Peter Briggs Keywords: collaborative web search, heystaks, reputation model

Abstract:

While web search tasks are often inherently collaborative in nature, many search engines do not explicitly support collaboration during search. In this paper, we describe HeyStaks (www.heystaks.com), a system that provides a novel approach to collaborative web search. Designed to work with mainstream search engines such as Google, HeyStaks supports searchers by harnessing the experiences of others as the basis for result recommendations. Moreover, a key contribution of our work is to propose a reputation system for HeyStaks to model the value of individual searchers from a result recommendation perspective. In particular, we propose an algorithm to calculate reputation directly from user search activity and we provide encouraging results for our approach based on a preliminary analysis of user activity and reputation scores across a sample of HeyStaks users.

The reputation system posed by the authors could easily underlie a collaborative approach to creation of a topic map.

Think collections not normally accessed by web search engines, The National Archives (U.S.) and similar document collections.

Reputation + trails + subject identity = Hard to Beat.

See www.heystaks.com as a starting point.

September 28, 2010

International Workshop on Similarity Search and Applications (SISAP)

Filed under: Indexing,Information Retrieval,Search Engines,Searching,Software — Patrick Durusau @ 4:47 pm

International Workshop on Similarity Search and Applications (SISAP)

Website:

The International Workshop on Similarity Search and Applications (SISAP) is a conference devoted to similarity searching, with emphasis on metric space searching. It aims to fill in the gap left by the various scientific venues devoted to similarity searching in spaces with coordinates, by providing a common forum for theoreticians and practitioners around the problem of similarity searching in general spaces (metric and non-metric) or using distance-based (as opposed to coordinate-based) techniques in general.

SISAP aims to become an ideal forum to exchange real-world, challenging and exciting examples of applications, new indexing techniques, common testbeds and benchmarks, source code, and up-to-date literature through a Web page serving the similarity searching community. Authors are expected to use the testbeds and code from the SISAP Web site for comparing new applications, databases, indexes and algorithms.

Proceedings from prior years, source code, sample data, a real gem of a site.

August 30, 2010

Is search passé?

Filed under: Interface Research/Design,Search Engines,Searching,Topic Maps — Patrick Durusau @ 4:50 pm

Is search passé? is an intriguing question asked at the Montangue Institute Review for August, 2010. Unfortunately, not being a member, I can’t summarize their answer for you.

It really isn’t that hard to guess some of them. I blogged about Blair and Maron saying twenty-five years ago:

Stated succinctly, it is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents, as can be seen in the following examples.

Documents and texts haven’t changed in the last twenty-five years. If anything, the problem has gotten worse due to the volume and variety of material that is now available for searching.

This is a semantic and therefore human judgment problem. Algorithms and “clever” data structures can assist human users in making those judgments, but can’t replace them in the loop.

Imagine a search engine that seeks the assistance of users on semantic issues. As opposed to the skulking around of current search engines and sites. Why not just ask? Politely.

A user-fed search engine with a topic map backend. That could be very interesting.

August 2, 2010

xISBN (Web service)

Filed under: FRBR,Search Engines — Patrick Durusau @ 2:56 pm

xISBN (Web service) should be of interest to topic mappers.

From the website:

The xISBN Web service supplies ISBNs and other information associated with an individual intellectual work that is represented in WorldCat. Submit an ISBN to this service, and it returns a list of related ISBNs and selected metadata.

****

ISBNs are related to each other in WorldCat using an algorithm developed by OCLC Research. The algorithm restructures WorldCat bibliographic records to conform to the FRBR conceptual model for information objects. For instance, rather than requiring an end user to traverse multiple records that represent many different manifestations of a book—including printings, hardback or paperback editions or even filmed versions—”FRBRized” WorldCat information allows that user to review a core record that lists all manifestations.

The xISBN Web service queries database tables in WorldCat created by the FRBR algorithm.

I got there from e-Book Finder, a nifty site build on top of xISBN, that tries to find electronic versions of books. Not necessarily free electronic versions, just electronic versions.

July 13, 2010

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis – SITE

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis is one of the projects behind the self-similarity and MapReduce posting.

From the project page:

The ASTERIX project is developing new technologies for ingesting, storing, managing, indexing, querying, analyzing, and subscribing to vast quantities of semi-structured information. The project is combining ideas from three distinct areas – semi-structured data, parallel databases, and data-intensive computing – to create a next-generation, open source software platform that scales by running on large, shared-nothing computing clusters.

Home of Hydrax Hyrax: Demonstrating a New Foundation for Data-Parallel Computation, “out-of-the-box support for common distributed communication patterns and set-oriented data operators.” (Need I say more?)

July 11, 2010

NTCIR (NII Test Collection for IR Systems) Project

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 7:47 am

NTCIR (NII Test Collection for IR Systems) Project focuses on information retrieval tasks in Japanese, Chinese, Korean, English and cross-lingual information retrieval.

From the project description:

For the laboratory-typed testing, we have placed emphasis on (1) information retrieval (IR) with Japanese or other Asian languages and (2) cross-lingual information retrieval. For the challenging issues, (3) shift from document retrieval to “information” retrieval and technologies to utilizing information in the documents, and (4) investigation for realistic evaluation, including evaluation methods for summarization, multigrade relevance judgments and single-numbered averageable measures for such judgments, evaluation methods suitable for retrieval and processing of particular document-genre and its usage of the user group of the genre and so on.

I know there are active topic map communities in both Japan and Korea. Perhaps this is a place to meet researchers working on issues closely similar to those in topic maps and to discuss the contribution that topic maps have to offer.

Forum for Information Retrieval Evaluation (FIRE)

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 6:44 am

Forum for Information Retrieval Evaluation (FIRE)  aims:

  • to encourage research in South Asian language Information Access technologies by providing reusable large-scale test collections for ILIR experiments
  • to explore new Information Retrieval / Access tasks that arise as our information needs evolve, and new needs emerge
  • to provide a common evaluation infrastructure for comparing the performance of different IR systems
  • to investigate evaluation methods for Information Access techniques and methods for constructing a reusable large-scale data set for ILIR experiments.

I know there is a lot of topic map development in South Asia and this looks like a great place to meet current researchers and to interest others in topic maps.

INEX: Initiative for Evaluation of XML Retrieval

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 6:30 am

INEX: Initiative for Evaluation of XML Retrieval is another must-see for serious topic map researchers.

No surprise that my first stop was the iNEX Publications page with proceedings from 2002-date.

However, INEX offers an opportunity for evaluation of topic maps in the context of other solutions, providing that one or more of us participate in the initiative.

If you or your institution decided to participate, please let others in the community know. I for one would like to join such an effort.

July 8, 2010

Taking Your Tool Kit to the Next Level

Filed under: Data Mining,Information Retrieval,Search Engines — Patrick Durusau @ 7:53 pm

Online Mathematics Textbooks is a good stop if you want to take your tool kit to the next level.

Plug-n-play indexing and search engines will do a lot out of the box but aren’t going to distinguish you from the competition.

Understanding the underlying algorithms will help make the data mining you do to populate your topic map qualitatively different.

Here’s your chance to brush up on your math skills without monetary investment.

***
PS: At some point, maybe at TMRA, a group of us need to draft an outline for a topic maps curriculum. Would have to include topic maps, obviously, but would also need to include courses in Information Retrieval, User Interfaces, Natural Language Processing, Classification, Math, what else? Would need to have “minors” in some particular subject area.

June 30, 2010

Scientists Develop World’s Fastest Program to Find Patterns in Social Networks – News

Filed under: RDF,Search Engines,Searching — Patrick Durusau @ 6:56 pm

Scientists Develop World’s Fastest Program to Find Patterns in Social Networks.

Actually the paper title is: COSI: Cloud Oriented Subgraph Identification in Massive Social Networks

Either way, this looks important for topic map fans.

How important?

The authors:

show our framework works efficiently, answering many complex queries over a 778M edge real-world SN dataset derived from Flickr, LiveJournal, and Orkut in under one second.

That important!

If you think about topic maps less as hand curated XML syntax artifacts and more as interactively and probabilistically created mappings into complex subject spaces then the importance of this research becomes even clearer.

June 18, 2010

TMQL4J suite 2.6.3 Released

Filed under: Search Engines,TMQL,Topic Map Software — Patrick Durusau @ 8:31 am

The Topic Maps Lab is becoming a hotbed of topic map software development.

TMQL4J 2.6.3 was released this week with the following features:

    New query factory – now it is possible to implement your own query types. If the query provides a transformation algorithm, it may be converted to a TMQL query and processed by the tmql4j engine.

  • New language processing – the two core modules ( the lexical scanner and the parser ) were rewritten to become more flexible and stable. The lexical scanner provides new methods to register your own language tokens ( as language extension ) or your own non-canonical tokens.
  • Default prefix – the engine provides the functionality of defining a default prefix in the context of the runtime. The prefix can be used without a specific pattern in the context of a query.
  • New interfaces – the interfaces were reorganized to enable an intuitive usage and understanding of the engine itself.

Plus a plugin architecture with plugins for Tmql4Ontopia, TmqlDraft2010, and TopicMapModificationLanguage. See the announcement for the details.

See also TMQL4J Documentation and Tutorials.

Interested your experiences with the interfaces which “…enable an intuitive usage and understanding of the engine itself.”

June 4, 2010

Hadoop-HBase-Lucene-Mahout-Nutch-Solr Digests

Filed under: Indexing,MapReduce,Search Engines,Software — Patrick Durusau @ 5:40 am

More interests than time?

Digests of developments in May 2010:

Hadoop

HBase

Lucene

Mahout

Nutch

Solr

Suggestions of other digest type sources and/or comments on such sources deeply appreciated.

May 25, 2010

A Mapmaker’s Manifesto

Filed under: Maps,Search Engines,Search Interface,Searching,Subject Identity,Usability — Patrick Durusau @ 3:48 pm

Search Patterns by Peter Moreville and Jeffrey Callender should be on your must read list. Their “Mapmaker’s Manifesto” will give you an idea of why I like the book:

  1. Search is a problem too big to ignore.
  2. Browsing doesn’t scale, even on an IPhone.
  3. Size matters. Linear growth compels a step change in design.
  4. Simple, fast, and relevant are table stakes.
  5. One size won’t fit all. Search must adapt to context.
  6. Search in iterative, social, and multisensory.
  7. Increments aren’t enough. Even Google must innovate or die.
  8. It’s not just about findability. It’s not just about the Web.
  9. The challenge is radically multidisciplinary.
  10. We must engage engineers and executives in design.
  11. We can learn from the past. Library science is still relevant.
  12. We can learn from behavior. Interaction design affords actionable results.
  13. We can learn from one user. Analytics is enriched by ethnography.
  14. Some patterns, we should study and reuse.
  15. Some patterns, we should break like a bad habit.
  16. Search is a complex adaptive system.
  17. Emergence, cocreation, and self-organization are in play.
  18. To discover the seeds of change, go outside.
  19. In science, fiction, and search, the map invents the territory.
  20. The future isn’t just unwritten—it’s unsearched.

I also like Search Patterns because the authors’ concede there are vast unknowns as opposed to saying: “If you just use our (insert paradigm/syntax/ontology/language) then all those nasty problems go away.”

I think we need to accept their invitation to face the vast unknowns head on.

« Newer PostsOlder Posts »

Powered by WordPress