Indexing « Another Word For It

January 13, 2011

Document Indexing – Wrong Level?

Filed under: Indexing,Search Engines — Patrick Durusau @ 8:16 am

I was reading the Jaccard distance treatment in Anand Rajaraman and Jeffrey D. Ullman and something that keeps nagging at me became clearer.

Is document indexing the wrong level for indexing?

Take a traditional research paper as an example.

You would give me low marks if I handed in a paper with the following as one of my footnotes:

# Principia Mathematica, Volume 1

But that is a perfectly acceptable result for a search engine. I am pointed to an entire document as relevant to my search.

True enough but hardly very helpful.

Search engines can take me to a document but that still leaves all the hard work to me.

Not that I mind the hard work but that hard work is done over and over again, as each user encounters the document.

Seems terribly inefficient to have the same work done each time the document is returned.

Say for example that I am searching for the proof that 1 + 1 = 2, I should be able to create a representative for that subject that points every searcher to the same location. As opposed to them digging out that bit of information for themselves.

I have heard that bit of information assigned various locations in Principia Mathematica. I am acquiring a reprint so I can verify its location for myself and will be posting its location.

Topic maps help because they are about subject indexing which I take to be different from document indexing.

A document index only tells you that somewhere in a document, one or more terms relevant to your search may be found. Not terribly helpful.

A subject index, on the other hand, particularly if made using a topic map, not only isolates the location of a subject but can also tell you additional information about the subject. Such as other information about the subject.

Comments Off

January 10, 2011

Engineering basic algorithms of an in-memory text search engine

Filed under: Data Structures,Indexing,Search Engines — Patrick Durusau @ 4:37 pm

Engineering basic algorithms of an in-memory text search engine Authors: Frederik Transier, Peter Sanders Keywords: Inverted index, in-memory search engine, randomization

Abstract:

Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.

Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.

A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

An interesting comparison of inverted indexes with suffix-arrays.

I am troubled by the reconstruct the input aspects of the paper.

While it is understandable and in some cases, more efficient, for data to be held in a localized data structure, my question is what do we do when data exceeds local storage capacity?

Think about the data held by Lexis/Nexis for example. Where would we put it while creating a custom data structure for its access?

There are data sets, important data sets, that have to be accessed in place.

And those data sets need to be addressed using topic maps.

*****
You may recall from the TAO paper by Steve Pepper the illustration of topics, associations and occurrences floating above a data set.

While topic map formats have been useful in many ways, they have distracted from the vision of topic maps as an information overlay as opposed to yet-another-format.

Formats are just that, formats. Pick one.

Comments Off

December 20, 2010

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word – Post

Filed under: Humor,Indexing,Search Engines — Patrick Durusau @ 7:24 pm

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word

Courtesy of http://searchengineland.com‘s Danny Sullivan, a highly amusing post on Google’s Ngram viewer.

Danny’s post only covers changing spelling and character rendering but serves to illustrate that the broader the time period covered, the greater the care needed to have results that make any sense at all.

Quite the post for the holidays!

Comments Off

Building blocks of a scalable webcrawler

Filed under: Indexing,Search Engines,Webcrawler — Patrick Durusau @ 4:41 am

Building blocks of a scalable webcrawler

From Marc Seeger’s post about his thesis:

This thesis documents my experiences trying to handle over 100 million sets of data while keeping them searchable. All of that happens while collecting and analyzing about 100 new domains per second. It covers topics from the different Ruby VMs (JRuby, Rubinius, YARV, MRI) to different storage-backend (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, …) and the data-structures that they use in the background.

Questions:

What components would need to be added to make this a semantic crawling project? (3-5 pages, citations)
What scalability issues would semantic crawling introduce? (3-5 pages, citations)
Design a configurable, scalable, semantic crawler. (Project)

Comments Off

December 11, 2010

Sensei

Filed under: Indexing,Lucene,NoSQL — Patrick Durusau @ 3:35 pm

Sensei

From the website:

Sensei is a distributed database that is designed to handle the following type of query:

SELECT f1,f2…fn FROM members
WHERE c1 AND c2 AND c3.. GROUP BY fx,fy,fz…
ORDER BY fa,fb…
LIMIT offset,count

Relies on zoie and hence Lucene for indexing.

Another comparison for the development of TMQL, which of course will need to address semantic sameness.

Comments (1)

December 9, 2010

Mining of Massive Datasets – eBook

Filed under: Clustering,Data Mining,Indexing,Machine Learning,Search Engines,Searching,String Matching — Patrick Durusau @ 11:48 am

Mining of Massive Datasets

Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

A free eBook no less.

Read Jeff’s post on your way to get a copy.

Look for more comments as I read through it.

Has anyone written a comparison of the recent search engine titles? Just curious.

Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

Comments (2)

December 8, 2010

Barriers to Entry in Search Getting Smaller – Post

Filed under: Indexing,Interface Research/Design,Search Engines,Search Interface,Searching — Patrick Durusau @ 9:49 am

Barriers to Entry in Search Getting Smaller

Jeff Dalton, Jeff’s Search Engine Caffè, makes a good argument that the barriers to entering the search market are getting smaller.

Jeff observes that blekko can succeed with a small number of servers only because its search demand is low.

True, but how many intra-company or litigation search engines are going to have web-sized user demands?

Start-ups need not try to match Google in its own space, but can carve out interesting and economically rewarding niches of their own.

Particularly if those niches involve mapping semantically diverse resources into useful search results for their users.

For example, biomedical researchers probably have little interest in catalog entries that happen to match gene names. Or any of the other common mis-matches offered by entire web search services.

In some ways, search the entire web services have created their own problem and then attempted to solve it.

My research interests are in information retrieval broadly defined so a search engine limited to library schools, CS programs (their faculty and students), the usual suspects for CS collections, library/CS/engineering organizations, with semantic mapping, would suit me just find.

Noting that the semantic mis-match problem persists even with a narrowing of resources, but the benefit of each mapping is incrementally greater.

Questions:

What resources are relevant to your research interests? (3-5 pages, web or other citations)
Create a Google account to create your own custom search engine and populate it with your resources.
Develop and execute 20 queries against your search engine and Google, Bing and one other search engine of your choice. Evaluate and report the results of those queries.
Would semantic mapping such as we have discussed for topic maps be more or less helpful with your custom search engine versus the others you tried? (3-5 pages, no citations)

Comments Off

December 4, 2010

Zoie: Real-time search indexing

Filed under: Full-Text Search,Indexing,Lucene,Search Engines,Software — Patrick Durusau @ 10:04 am

Zoie: Real-time search indexing

Somehow appropriate that following the lead on Kafka would lead me to Zoie (and other goodies to be reported).

From the website:

Zoie is a real-time search and indexing system built on Apache Lucene.

Donated by LinkedIn.com on July 19, 2008, and has been deployed in a real-time large-scale consumer website: LinkedIn.com handling millions of searches as well as hundreds of thousands of updates daily.

News: Zoie 2.0.0 is released … – Compatible with Lucene 2.9.x.

In a real-time search/indexing system, a document is made available as soon as it is added to the index. This functionality is especially important to time-sensitive information such as news, job openings, tweets etc.

Design Goals:

Additions of documents must be made available to searchers immediately
Indexing must not affect search performance
Additions of documents must not fragment the index (which hurts search performance)
Deletes and/or updates of documents must not affect search performance.

In topic map terms:

Additions to topic map must be made available to searchers immediately
Indexing must not affect search performance
Additions to topic map must not fragment the index (which hurts search performance)
Deletes and/or updates of a topic map must not affect search performance.

I would say that #’s 3 and 4 are research questions at this point.

Additions, updates and deletions in a topic map may have unforeseen (unforeseeable?) consequences.

Such as causing:

merging to occur
merging to be undone
roles to be played
roles to not be played
association to be valid
association to be invalid

to name only a few.

It may be possible to formally prove the impact that certain events will have but I am not aware of any definitive analysis on the subject.

Suggestions?

Comments Off

December 3, 2010

Dynamic Indexes?

Filed under: Authoring Topic Maps,Graphic Processors,Indexing,Information Flow,Information Retrieval — Patrick Durusau @ 5:23 pm

I was writing the post about the New York Times graphics presentation when it occurred to me how close we are to dynamic indexes.

After all, gaming consoles are export restricted.

What we now consider to be “runs,” static indexes and the like are computational artifacts.

They follow how we created indexes when they were done by hand.

What happens when the properties of what is being indexed, its identifications and merging rules can change on the fly and re-present itself to the user for further manipulation?

I don’t think the fundamental issues of index construction get any easier with dynamic indexes but how we answer them will determine how quickly we can make effective use of such indexes.

Whether crossing the line first to dynamic indexes will be a competitive advantage, only time will tell.

I would like for some VC to be interested in finding out.

Caveat to VCs. If someone pitches this as making indexes more quickly, that isn’t the point. “Quick” and “dynamic” aren’t the same thing. Related but different. Keep both hands on your wallet.

Comments Off

November 25, 2010

Sig.ma – Live views on the Web of Data

Filed under: Indexing,Information Retrieval,Lucene,Mapping,RDF,Search Engines,Semantic Web — Patrick Durusau @ 10:27 am

Sig.ma – Live views on the Web of Data

From the website:

In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.

Read one of various versions of an article on Sig.ma for the technical details.

From the Web Technologies article cited on the homepage:

Sig.ma revolves around the creation of Entity Profiles. An entity profile – which in the Sig.ma dataflow is represented by the “data cache” storage (Fig. 3) – is a summary of an entity that is presented to the user in a visual interface, or which can be returned by the API as a rich JSON object or a RDF document. Entity profiles usually include information that is aggregated from more than one source. The basic structure of an entity profile is a set of key-value pairs that describe the entity. Entity profiles often refer to other entities, for example the profile of a person might refer to their publications.

No, this isn’t an implementation of the TMRM.

This is an implementation of one way to view entities for a particular type of data. A very exciting one but still limited to a particular data set.

This is a big step forward.

For example, it isn’t hard to imagine entity profiles against particular websites or data sets. Entity profiles that are maintained and leased for use with search engines like Sig.ma.

Or going a bit further and declaring a basis for identification of subjects, such as the existence of properties a…n in an RDF graph.

Questions:

Spend a couple of hours with Sig.ma researching library related questions. (Discussion)
What did you like, dislike or find surprising about Sig.ma? (3-5 pages, no citations)
Entity profiles for library science (Class project)

Comments (1)

Sig.ma: Live Views on the web of data – bibliography issues

Filed under: Indexing,Information Retrieval,Lucene,Mapping,RDF,Search Engines,Searching,Semantic Web — Patrick Durusau @ 9:42 am

I normally start with a DOI here so you can see article in question.

Not here.

Here’s why:

Sig.ma: Live views on the Web of Data Journal of Web Semantics. (10 pages)

Sig.ma: Live Views on the Web of Data WWW ’10 Proceedings(demo, 4 pages)

Sig.ma: Live Views on the Web of Data (8 pages) http://richard.cyganiak.de/2008/papers/sigma-semwebchallenge2009.pdf

Sig.ma: Live Views on the Web of Data (4 pages) http://richard.cyganiak.de/2008/papers/sigma-demo-www2010.pdf

Sig.ma: Live Views on the Web of Data (25 pages) http://fooshed.net/paper/JWS2010.pdf

Before saying anything ugly, ;-), this is some of the most exciting research I have seen in a long time. I will cover that part of it in a following post. But, to the matter at hand, bibliographic control.

Five (5) different articles, two published in recognized journals that all have the same name? (The demo articles are the same but have different headers/footers, page numbers and so would likely be indexed as different articles.)

I will be able to resolve any confusion by obtaining the article in question.

But that isn’t an excuse.

I, along with everyone else interested in this research, will waste a small part of our time resolving the confusion. Confusion that could have been avoided for everyone.

Not unlike everyone who does the same search having to tread the same google glut.

With no way to pass on what we have resolved, for the benefit of others.

Questions:

Help these authors out. How would you suggest they avoid this in the future? Use of the name is important. (3-5 pages, no citations)
Help the library out. How will you deal with multiple papers with the same title, authors, pub year? (this isn’t uncommon) (3-5 pages, citations optional)
How would you use topic maps to resolve this issue? (3-5 pages, no citations)

Comments (1)

A Node Indexing Scheme for Web Entity Retrieval

Filed under: Entity Resolution,Full-Text Search,Indexing,Lucene,RDF,Topic Maps — Patrick Durusau @ 6:15 am

A Node Indexing Scheme for Web Entity Retrieval Authors(s): Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello Keywords: entity, entity search, full-text search, semi-structured queries, top-k query, node indexing, incremental index updates, entity retrieval system, RDF, RDFa, Microformats

Abstract:

Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.

Consider the requirements for this project:

Support for the multiple formats which are used on the Web of Data;

Support for searching an entity description given its characteristics (entity centric search);

Support for context (provenance) of information: entity descriptions are given in the context of a website or a dataset;

Support for semi-structural queries with full-text search, top-k query results, scalability over shard clusters of commodity machines, efficient caching strategy and incremental index maintenance.

(emphasis added)

SIREn { Semantic Information Retrieval Engine }

Definitely a package to download, install and start to evaluate. More comments forthcoming.

Questions (more for topic map researchers)

To what extent can “entity description” = properties of topics, associations, occurrences?
Can XTM, etc., be regarded as “microformats” for the purposes of SIREn?
To what extent does SIREn meet or exceed query requirements for XTM/TMDM based topic maps?
Reports on use of SIREn by topic mappers?

Comments (1)

November 22, 2010

A Fun Application of Compact Data Structures to Indexing Geographic Data

Filed under: Geographic Information Retrieval,Indexing,Spatial Index — Patrick Durusau @ 6:07 am

A Fun Application of Compact Data Structures to Indexing Geographic Data Author(s): Nieves R. Brisaboa, Miguel R. Luaces, Gonzalo Navarro, Diego Seco Keywords: geographic data, MBR, range query, wavelet tree

Abstract:

The way memory hierarchy has evolved in recent decades has opened new challenges in the development of indexing structures in general and spatial access methods in particular. In this paper we propose an original approach to represent geographic data based on compact data structures used in other fields such as text or image compression. A wavelet tree-based structure allows us to represent minimum bounding rectangles solving geographic range queries in logarithmic time. A comparison with classical spatial indexes, such as the R-tree, shows that our structure can be considered as a fun, yet seriously competitive, alternative to these classical approaches.

I must confess that after reading this article more than once, I still puzzle over: “Our experiments, featuring GIS-like scenarios, show that our index is a relevant and funnier alternative to classical spatial indexes, such as the R-tree ….”

I admit to being drawn to esoteric and even odd solutions but I would not describe most of them as being “funnier” than an R-tree.

For all that, the article will be useful to anyone developing topic maps for use with spatial indexes and is a good introduction to wavelet trees.

Questions:

Create an annotated bibliography of spatial indexes. (date limit, last five (5) years)
Create an annotated bibliography of spatial data resources. (date limit, last five (5) years)
How would you use MBRs (Minimum Bounding Rectangles) for merging purposes in a topic map? (3-5 pages, no citations)

Comments (1)

November 15, 2010

Towards Index-based Similarity Search for Protein Structure Databases

Filed under: Bioinformatics,Biomedical,Indexing,Similarity — Patrick Durusau @ 5:00 am

Towards Index-based Similarity Search for Protein Structure Databases Authors: Orhan Çamoǧlu, Tamer Kahveci, Ambuj K. Singh Keywords: Protein structures, feature vectors, indexing, dataset join

Abstract:

We propose two methods for finding similarities in protein structure databases. Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. These feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. This technique quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while keeping the sensitivity similar.

Unless you want to do a project on bioinformatics indexing and topic maps, this paper probably isn’t of much interest.

I include it as an illustration of fashioning an domain specific index and for those who are interested, what subjects and their definitions lurk therein.

Questions (for those who want to pursue both topic maps and bioinformatics):

Isolate all the “we chose” aspects of the paper. What results would have been different with other choices? The “we obtained best results…” is unsatisfying. In what sense “best results?”
What aspects of this process would be amenable to use of a topic map?
What about the results (if anything) would have to be different to make these results meaningful in a topic map to be merged with results by other researchers?

Comments Off

November 10, 2010

MongoDB Indexes and Indexing – Post

Filed under: Indexing,MongoDB,NoSQL — Patrick Durusau @ 2:25 pm

MongoDB Indexes and Indexing and MongoDB Indexing: An Optimization Primer from Alex Popescu provide great coverage of indexing and indexing issues.

Funny how topic maps started with indexing, revolve around the semantic issues of indexes/indexing and have to rely on indexing for reasonable performance.

Will have to see what other indexing resources I can dig up.

Enjoy the videos!

Comments Off

November 7, 2010

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering

Filed under: Clustering,Indexing — Patrick Durusau @ 8:26 pm

Orthogonal Nonnegative Matrix Tri-factorization for Semi-supervised Document Co-clustering Authors: Huifang Ma, Weizhong Zhao, Qing Tan and Zhongzhi Shi Keywords: Semi-supervised Clustering, Pairwise Constraints, Word-Level Constraints, Nonnegative Matrix tri-Factorization

Abstract:

Semi-supervised clustering is often viewed as using labeled data to aid the clustering process. However, existing algorithms fail to consider dual constraints between data points (e.g. documents) and features (e.g. words). To address this problem, in this paper, we propose a novel semi-supervised document co-clustering model OSS-NMF via orthogonal nonnegative matrix tri-factorization. Our model incorporates prior knowledge both on document and word side to aid the new word-category and document-cluster matrices construction. Besides, we prove the correctness and convergence of our model to demonstrate its mathematical rigorous. Our experimental evaluations show that the proposed document clustering model presents remarkable performance improvements with certain constraints.

Questions:

Relies on user input, but is the user input transferable? Or is it document/collection specific? (3-5 pages, no citations)
Is document level retrieval too coarse? (discussion)
Subset selection, understandable for testing/development. Doesn’t it seem odd no tests were done against entire collections? (discussion)
What of the exclusion of words that occur less than 3 times? Aren’t infrequent terms more likely to be significant? (3-5 pages, no citations)

Comments Off

October 24, 2010

Indexing Nature: Carl Linnaeus (1707-1778) and His Fact-Gathering Strategies

Filed under: Indexing,Information Retrieval,Interface Research/Design,Ontology — Patrick Durusau @ 9:52 am

Indexing Nature: Carl Linnaeus (1707-1778) and His Fact-Gathering Strategies Authors: Staffan Müller-Wille & Sara Scharf (Working Papers on The Nature of Evidence: How Well Do ‘Facts’ Travel? No. 36/08)

Interesting article that traces the strategies used by Linnaeus when confronted with the “first bio-information crisis” as the authors term it.

Questions:

In what ways do ontologies resemble the bound library catalogs of the early 18th century?
Do computers make ontologies any less like those bound library catalogs?
Short report (3-5 pages, with citations) on transition of libraries from bound catalogs to index cards.
Linnaeus’s colleagues weren’t idle. What other strategies, successful or otherwise, were in use? (project)

Comments Off

October 22, 2010

Rethinking Library Linking: Breathing New Life into OpenURL

Filed under: Cataloging,Indexing,OpenURL,Subject Identity,Topic Maps — Patrick Durusau @ 7:26 am

Rethinking Library Linking: Breathing New Life into OpenURL Authors: Cindi Trainor and Jason Price

Abstract:

OpenURL was devised to solve the “appropriate copy problem.” As online content proliferated, it became possible for libraries to obtain the same content from multiple locales: directly from publishers and subscription agents; indirectly through licensing citation databases that contain full text; and, increasingly, from free online sources. Before the advent of OpenURL, the only way to know whether a journal was held by the library was to search multiple resources. An OpenURL link resolver accepts links from library citation databases (sources) and returns to the user a menu of choices (targets) that may include links to full text, the library catalog, and other related services (figure 1). Key to understanding OpenURL is the concept of “context sensitive” linking: links to the same item will be different for users of different libraries, and are dependent on the library’s collections. This issue of Library Technology Reports provides practicing librarians with real-world examples and strategies for improving resolver usability and functionality in their own institutions.

Resources:

OpenURL (ANSI/NISO Z39.88-2004

openURL@oclc.org archives

Questions:

OCLC says of OpenURL

Remember the card catalog? Everything in a library was represented in the card catalog with one or more cards carrying bibliographic information. OpenURL is the internet equivalent of those index cards.
True? 3-5 pages, no citations, or
False? 3-5 pages, no citations.

Comments Off

Neo4j 1.2 Milestone 2 – Release

Filed under: Graphs,Indexing,Neo4j,Software — Patrick Durusau @ 6:02 am

Neo4j 1.2 Milestone 2 has been released!

Relevant to topic maps in general and TMQL in particular, are the improvement to indexing and querying capabilities.

Neo4j uses Lucene as a back-end.

Would Neo4j be a good way to proto-type proposals for TMQL?

To evaluate concerns about implementation difficulties.

And quite possibly to encourage the non-invention of new query syntaxes.

A side effect would be demonstrating that Neo4j could be used as a topic map platform.

Comments Off

October 19, 2010

Enhancing Graph Database Indexing by Suffix Tree Structure

Filed under: Graphs,Indexing,Suffix Tree — Patrick Durusau @ 8:02 am

Enhancing Graph Database Indexing by Suffix Tree Structure
Authors: Vincenzo Bonnici, Alfredo Ferro, Rosalba Giugno, Alfredo Pulvirenti, Dennis Shasha Keywords: subgraph isomorphism, graph database search, indexing, suffix tree, molecular database

Abstract:

Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, scientists require systems that search for all occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.

This paper presents GraphGrepSX. The system implements efficient graph searching algorithms together with an advanced filtering technique. GraphGrepSX is compared with SING, GraphFind, CTree and GCoding. Experiments show that GraphGrepSX outperforms the compared systems on a very large collection of molecular data. In particular, it reduces the size and the time for the construction of large database index and outperforms the most popular systems. (hyperlinks added.)

Be aware that bioinformatics is at the cutting edge of search/retrieval technology. Pick up any proceedings volume for the last year to see what I mean.

A credible topic map is going to incorporate one or more of the techniques you will find there, plus semantic mapping based on those techniques.

Saying Topic-Association-Occurrence is only going to get you past the first two minutes of your presentation. You will need something familiar (to your audience) and domain specific to fill the rest of your time.

BTW, see the audience posting earlier today. Don’t guess what will interest your audience. Ask someone in that community what interests them.

Comments Off

October 7, 2010

WebGraph

Filed under: Graphs,Indexing,Navigation,Searching,Software — Patrick Durusau @ 5:56 am

WebGraph was mentioned in the article Fast and Compact Web Graph Representations.

Great work on the web graph, with software and data sets for exploring!

(Warning: If you like this sort of thing you will lose hours if not days here.)

Questions:

Is the Web Graph different from a graph of a topic map?
How would you go about researching question #1?
Would your answer to #1 vary depending on the topic map you chose?
Would the size of a topic map affect your answer?
How would you test your answer to #4?
What other aspects of graphs would you want to explore on topic maps?

Comments Off

October 6, 2010

Recognizing an Interchangeable Identifier

Filed under: Indexing,Semantics,Subject Identifiers,Subject Identity — Patrick Durusau @ 7:13 am

Subjects & Identifiers shows why we need interchangeable identifiers.

Q: How would you recognize an interchangeable identifier?

A: Oh, yeah, that’s right. Anything we can talk about has an identifier, so how to recognize an interchangeable identifier?

If two people agree on column headers for a database table, they have interchangeable identifiers for the columns, at least between the two of them.

There are two requirements for interchangeable identifiers:

Identification as an identifier.
Notice of the identifier.

Any token can be an identifier under some circumstances so identifiers must be identified for interchange.

Notice of an identifier is usually a matter of being part of a profession or discipline. Some term is an identifier because it was taught to you as one.

That works but for local interchange, but public interchange requires publicly documented identifiers.

That’s it. Identify identifiers and document the identifiers publicly and you will have public interchangeable identifiers.

It can’t be that simple? Well, truthfully, it’s not.

More on public interchangeable identifiers forthcoming.

Comments (3)

September 28, 2010

International Workshop on Similarity Search and Applications (SISAP)

Filed under: Indexing,Information Retrieval,Search Engines,Searching,Software — Patrick Durusau @ 4:47 pm

International Workshop on Similarity Search and Applications (SISAP)

Website:

The International Workshop on Similarity Search and Applications (SISAP) is a conference devoted to similarity searching, with emphasis on metric space searching. It aims to fill in the gap left by the various scientific venues devoted to similarity searching in spaces with coordinates, by providing a common forum for theoreticians and practitioners around the problem of similarity searching in general spaces (metric and non-metric) or using distance-based (as opposed to coordinate-based) techniques in general.

SISAP aims to become an ideal forum to exchange real-world, challenging and exciting examples of applications, new indexing techniques, common testbeds and benchmarks, source code, and up-to-date literature through a Web page serving the similarity searching community. Authors are expected to use the testbeds and code from the SISAP Web site for comparing new applications, databases, indexes and algorithms.

Proceedings from prior years, source code, sample data, a real gem of a site.

Comments Off

September 23, 2010

HUGO Gene Nomenclature Committee

Filed under: Bioinformatics,Biomedical,Data Mining,Entity Extraction,Indexing,Software — Patrick Durusau @ 8:32 am

HUGO Gene Nomenclature Committee, a committee assigning unique names to genes.

Become familiar with the HUGO site, then read: The success (or not) of HUGO nomenclature (Genome Biology, 2006).

Now read: Moara: a Java library for extracting and normalizing gene and protein mentions (BMC Bioinformatics 2010)

Q: How you would apply the techniques in the Moara article to build a topic map? Would you keep/discard normalization?

PS: Moara Project (software, etc.)

Comments Off

September 20, 2010

Catalogue & Index Blog

Filed under: Cataloging,Indexing — Patrick Durusau @ 2:40 pm

Catalogue & Index Blog.

Blog from the Chartered Institute of Library and Information Professionals (CILIP) Cataloging and Indexing Group.

News about cataloging, indexing and Cataloging and Indexing Group activities.

Comments Off

September 17, 2010

A Logical Account of Lying

Filed under: Classification,Indexing,Subject Identifiers — Patrick Durusau @ 2:46 pm

A Logical Account of Lying Authors:Chiaki Sakama, Martin Caminada and Andreas Herzig Keywords: lying, lies, argumentation systems, artificial intelligence, multiagent systems, intelligent agents.

Abstract:

This paper aims at providing a formal account of lying – a dishonest attitude of human beings. We first formulate lying under propositional modal logic and present basic properties for it. We then investigate why one engages in lying and how one reasons about lying. We distinguish between offensive and defensive lies, or deductive and abductive lies, based on intention behind the act. We also study two weak forms of dishonesty, bullshit and deception, and provide their logical features in contrast to lying. We finally argue dishonesty postulates that agents should try to satisfy for both moral and self-interested reasons. (emphasis in original)

Be the first to have your topic map distinguish between:

offensive lies
defensive lies
deductive lies
abductive lies (Someone tweet John Sowa please.)
deception
bullshit

Subj3ct.com has an identifier for the subject “bullshit,” http://dbpedia.org/resource/Bullshit, but it does not reflect this latest analysis.

Comments Off

September 10, 2010

LNCS Volume 6263: Data Warehousing and Knowledge Discovery

Filed under: Database,Graphs,Heterogeneous Data,Indexing,Merging — Patrick Durusau @ 8:20 pm

LNCS Volume 6263: Data Warehousing and Knowledge Discovery edited by Torben Bach Pedersen, Mukesh K. Mohania, A Min Tjoa, has a number of articles of interest to the topic map community.

Here are five (5) that caught my eye:

A Model-Driven Heuristic Approach for Detecting Multidimensional Facts in Relational Data Sources Author(s): Andrea Carmè, Jose-Norberto Mazón, Stefano Rizzi Keywords: relational data sources, multidimensional schema, model-driven, data warehouse, multidimensional facts.
A Graph-Based Clustering Scheme for Identifying Related Tags in Folksonomies Author(s): Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali Keywords: graph-based clustering – community detection – folksonomies – tag recommendation.
$\mathcal{F}$&$\mathcal{A}$: A Methodology for Effectively and Efficiently Designing Parallel Relational Data Warehouses on Heterogenous Database Clusters Author(s): Ladjel Bellatreche, Alfredo Cuzzocrea, Soumia Benkrid Keywords: homogeneous clusters, heterogeneous clusters, data warehouses, relational data, naive replication algorithm.
XML Data Fusion Author(s): Frantchesco Cecchin, Cristina Dutra Aguiar Ciferri, Carmem Satie Hara Keywords: XML data fusion, data cleaning rules, value conflicts, integration, fusion policy validation, XFusion.
An Efficient Duplicate Record Detection Using q-Grams Array Inverted Index Author(s): Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, Alfredo Pulvirenti Keywords: Duplicate record detection – q-grams – inverted index – bitmaps – clustering.

Comments Off

August 15, 2010

Index Merging

Filed under: Database,Indexing — Patrick Durusau @ 4:47 pm

Index Merging by Surajit Chaudhuri and Vivek Narasayya caught my eye for obvious reasons!

I must admit to some disappointment when I found it was collecting index columns and placing them together in a single table. I am sure that technique is quite valuable for data warehouses but isn’t what I think of when I use the phrase, “merging indexes.”

The article is well written and was worth reading. As I started to put it to one side, it occurred to me that perhaps I was too hasty in deciding it wasn’t relevant to topic maps.

What if I had a data warehouse with a “merged” index where collectively the columns supported queries based on subject identity? Or if I wanted to use a set of indexes from other applications (say Lucene for example), to query against for similar purposes?

Whether you are into .Net or not, you should add this one to your reading list.

Comments Off

August 13, 2010

Prescriptive vs. Adaptive Information Retrieval?

Filed under: Concept Hierarchies,Indexing,Information Retrieval,Thesaurus — Patrick Durusau @ 8:47 pm

Gary W. Strong and M. Carl Drott, contend in A Thesaurus for End-User Indexing and Retrieval, Information Processing & Management, Vol. 22, No. 6, pp. 487-492, 1986, that:

A low-cost, practical information retrieval system, if it were to be designed, would require a thesaurus, but one in which end-users would be able to browse research topics by means of an organization that is concept-based rather than term-based as is the typical thesaurus.

…. (while elsewhere)

It is our hypothesis that, when the thesaurus can be envisioned by users as a simple, yet meaningful, organization of concepts, the entire information system is much more likely to be useable in an efficient manner by novice users. (emphasis added)

It puzzles me that experts are building a system of concepts for novices to use. Do you suspect experts have different views of the domains in question than novices? And approach their search for information with different assumptions?

Any concept system designed by an expert is a prescriptive information retrieval system. It represents their view of the domain and not that of a novice. Or rather it represents how the expert thinks a novice should navigate the field.

While the expert’s view may be useful for some purposes, such as socializing a novice into a particular view of the domain, it may be more useful for novices to use a novice’s view of the domain. To build that we would need to turn to novices in a domain. Perhaps through the use of adaptive information retrieval, IR that adapts to its user, rather than the other way around.

Adaptive information retrieval systems, I like that, ones that grow to be more like their users and less like their builders with every use.

Comments Off

August 4, 2010

Prefix Hash Tree: An Indexing Data Structure over Distributed Hash Tables – Post

Filed under: Indexing — Patrick Durusau @ 6:57 pm

Prefix Hash Tree: An Indexing Data Structure over Distributed Hash Tables was posted by Alex Popescu over at myNoSQL.

Definitely one for the reading list, but, check out a copy at Citeseer: Prefix Hash Tree: An Indexing Data Structure over Distributed Hash Tables, has the advantage of citations, etc.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 13, 2011

January 10, 2011

December 20, 2010

December 11, 2010

December 9, 2010

December 8, 2010

December 4, 2010

December 3, 2010

November 25, 2010

November 22, 2010

November 15, 2010

November 10, 2010

November 7, 2010

October 24, 2010

October 22, 2010

October 19, 2010

October 7, 2010

October 6, 2010

September 28, 2010

September 23, 2010

September 20, 2010

September 17, 2010

September 10, 2010

August 15, 2010

August 13, 2010

August 4, 2010