Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 22, 2011

National Archives Digitization Tools Now on GitHub

Filed under: Archives,Files,Video — Patrick Durusau @ 3:18 pm

National Archives Digitization Tools Now on GitHub

From the post:

As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.

We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.

I suspect many government departments (U.S. and otherwise) have similar digitization workflow efforts underway. Perhaps greater publicity about these efforts will cause other departments to step forward.

DQM-Vocabulary

Filed under: Semantic Web,Vocabularies — Patrick Durusau @ 3:17 pm

DQM-Vocabulary announced by Christian Fürber:

The DQM-Vocabulary supports data quality management activities in Semantic Web architectures. It’s major strength is the ability to represent data requirements, i.e. prescribed (individual) directives or consensual agreements that define the content and/or structure that constitute high quality data instances and values, so that computers can interpret the requirements and take further actions. Among other things, the DQM-Vocabulary supports the following tasks:

  • Automated creation of data quality monitoring and assessment reports based on previously specified data requirements
  • Exchange of data quality information and data requirements on web-scale
  • Automated consistency checks between data requirements

The DQM-Vocabulary is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license at http://purl.org/dqm-vocabulary/v1/dqm

A primer with examples on how to use the DQM-Vocabulary can be found at http://purl.org/dqm-vocabulary

A mailing list for issues and questions around the DQM-Vocabulary can be found at http://groups.google.com/group/dqm-vocabulary

Interesting work but I think pretty obviously only commercial interests are going to have an incentive to put in the time and effort to use such a system.

Which reminds me, how is this different from the OASIS Universal Business Language (UBL) activity? UBL has already been adopted in a number of countries, particularly for government contracts. They have specified the semantics that businesses need to automate some contractual matters.

I suppose more broadly, where is the commercial demand for the DQM-Vocabulary?

Are there identifiable activities that lack data quality management now, for which DQM will be a solution? If so, which ones?

If other data quality management solutions are in place, what advantages over current systems are offered by DQM? Are those sufficient to justify changing present systems?

Cloudera Training Videos

Filed under: Hadoop,HBase,Hive,MapReduce,Pig — Patrick Durusau @ 3:17 pm

Cloudera Training Videos

Cloudera has added several training videos on Hadoop and parts of the Hadoop ecosystem.

You will find:

  • Introduction to HBase – Todd Lipcon
  • Thinking at Scale
  • Introduction to Apache Pig
  • Introduction to Apache MapReduce and HDFS
  • Introduction to Apache Hive
  • Apache Hadoop Ecosystem
  • Hadoop Training Virtual Machine
  • Hadoop Training: Programming with Hadoop
  • Hadoop Training: MapReduce Algorithms

No direct links to the videos because new resources/videos will appear more quickly at the Cloudera site than I will be updating this list.

Now you have something to watch this weekend (Oct. 22-23, 2011) other than reports on and of the World Series! Enjoy!

A history of the world in 100 seconds

Filed under: Data Mining,Geographic Data,Visualization — Patrick Durusau @ 3:17 pm

A history of the world in 100 seconds by Gareth Lloyd.

From the post:

Many Wikipedia articles are tagged with geographic coordinates. Many have references to historic events. Cross referencing these two subsets and plotting them year on year adds up to a dynamic visualization of Wikipedia’s view of world history.

The ‘spotlight’ is an overlay on the video that tries to keep about 90% of the datapoints within the bright area. It takes a moving average of all the latitudes and longitudes over the past 50 or so years and centres on the mean coordinate. I love the way it opens up, first resembling medieval maps of “The World” which included only Europe and some of Asia, then encompassing “The New World” and finally resembling a modern map.

This is based on the thing that me and Tom Martin built at Matt Patterson’s History Hackday. To make it, I built a python SAX Parser that sliced and diced an xml dump of all wikipedia articles (30Gb) and pulled out 424,000 articles with coordinates and 35,000 references to events. We managed to pair up 14,238 events with locations, and Tom wrote some Java to fiddle about with the coordinates and output frames. I’ve hacked around some more to add animation, because, you know, why not?

I wanted to point this post out separately for several reasons.

First, it is a good example of re-use of existing data in a new and/or interesting way. That avoids you having to spend time collecting up the original data.

Second, Gareth provides both the source code and data so you can verify his results for yourself or decide that some other visualization suits your fancy.

Third, you should read some of the comments about this work. That sort of thing is going to occur no matter what resource or visualization you make available. If you had a super-Wiki with 10 million articles in the top ten languages of the world, some wag would complain that X language wasn’t represented. Not that they would contribute to making it available, but they have the time to complain that you didn’t.

Processing every Wikipedia article

Filed under: Data Mining,Data Source — Patrick Durusau @ 3:17 pm

Processing every Wikipedia article by Gareth Lloyd.

From the post:

I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia’s articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What’s more, it’s easy once you know how.

An alternative method for accessing and parsing Wikipedia data. I probably need to do a separate post on the Visualization post.

Enjoy!

Edelweiss

Filed under: Interface Research/Design,Ontology,Semantic Web — Patrick Durusau @ 3:17 pm

Edelweiss

From the website:

The research team Edelweiss aims at offering models, methods and techniques for supporting knowledge management and collaboration in virtual communities interacting with information resources through the Web. This research will result in an ergonomic graph-based and ontology-based platform.

This research will result in an ergonomic graph-based and ontology-based platform. Activity Report 2010
Located at INRIA Sophia Antipolis-Méditerranée, Edelweiss was previously known as Acacia.
Edelweiss stands for…

  • Exchanges : communication, diffusion, knowledge capitalization, reuse, learning.
  • Documents : texts, multimedia, XML.
  • Extraction : semi-automated information extraction from documents.
  • Languages : graphs, semantic web languages, logics.
  • Webs : architectures, diffusion, implementation.
  • Ergonomics : user interfaces, scenarios.
  • Interactions : interaction design, protocols, collaboration.
  • Semantics : ontologies, semantic annotations, formalisms, reasoning.
  • Servers : distributed services, distributed data, applications.

Good lists of projects, software, literature, etc.

Anyone care to share any longer acronyms in actual use at projects?

How to clone Wikipedia and index it with Solr

Filed under: Indexing,Solr — Patrick Durusau @ 3:17 pm

How to clone Wikipedia and index it with Solr

Looks like it is going to be a Wikipedia sorta day! 😉 Seriously, Wikipedia is increasing in importance with every new page or edit. Not to mention that the cited posts will give you experience with a variety of approaches and tools to dealing with Wikipedia as a data set.

From the post:

A major milestone for ZimZaz: I have (finally) successfully cloned Wikipedia and indexed it with Solr. It took about six weeks in calendar time and felt like a lot more. If I had made no mistakes and had to learn nothing, I could have done it in less than a business week. In the spirit of documenting my work and helping others, here are the key steps along the way.

Kudos to the author for documenting what went wrong and what went right!

Java Wikipedia Library (JWPL)

Filed under: Data Mining,Java,Software — Patrick Durusau @ 3:16 pm

Java Wikipedia Library (JWPL)

From the post:

Lately, Wikipedia has been recognized as a promising lexical semantic resource. If Wikipedia is to be used for large-scale NLP tasks, efficient programmatic access to the knowledge therein is required.

JWPL (Java Wikipedia Library) is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia. The high-performance Wikipedia API provides structured access to information nuggets like redirects, categories, articles and link structure. It is described in our LREC 2008 paper.

JWPL contains a Mediawiki Markup parser that can be used to further analyze the contents of a Wikipedia page. The parser can also be used stand-alone with other texts using MediaWiki markup.

Further, JWPL contains the tool JWPLDataMachine that can be used to create JWPL dumps from the publicly available dumps at download.wikimedia.org.

Wikipedia is a resource of growing interest. This toolkit may prove useful in mining it for topic map purposes.

Introducing fise, the Open Source RESTful Semantic Engine

Filed under: Entity Extraction,Entity Resolution,Language,Semantics,Taxonomy — Patrick Durusau @ 3:16 pm

Introducing fise, the Open Source RESTful Semantic Engine

From the post:

fise is now known as the Stanbol Enhancer component of the Apache Stanbol incubating project.

As a member of the IKS european project Nuxeo contributes to the development of an Open Source software project named fise whose goal is to help bring new and trendy semantic features to CMS by giving developers a stack of reusable HTTP semantic services to build upon.

Presenting the software in Q/A form:

What is a Semantic Engine?

A semantic engine is a software component that extracts the meaning of a electronic document to organize it as partially structured knowledge and not just as a piece of unstructured text content.

Current semantic engines can typically:

  • categorize documents (is this document written in English, Spanish, Chinese? is this an article that should be filed under the  Business, Lifestyle, Technology categories? …);
  • suggest meaningful tags from a controlled taxonomy and assert there relative importance with respect to the text content of the document;
  • find related documents in the local database or on the web;
  • extract and recognize mentions of known entities such as famous people, organizations, places, books, movies, genes, … and link the document to there knowledge base entries (like a biography for a famous person);
  • detect yet unknown entities of the same afore mentioned types to enrich the knowledge base;
  • extract knowledge assertions that are present in the text to fill up a knowledge base along with a reference to trace the origin of the assertion. Examples of such assertions could be the fact that a company is buying another along with the amount of the transaction, the release date of a movie, the new club of a football player…

During the last couple of years, many such engines have been made available through web-based API such as Open Calais, Zemanta and Evri just to name a few. However to our knowledge there aren't many such engines distributed under an Open Source license to be used offline, on your private IT infrastructure with your sensitive data.

Impressive work that I found through a later post on using this software on Wikipedia. See Mining Wikipedia with Hadoop and Pig for Natural Language Processing.

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

Filed under: Hadoop,Natural Language Processing,Pig — Patrick Durusau @ 3:16 pm

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

One problem with after-the-fact assignment of semantics to text is that the volume of text involved (usually) is too great for manual annotation.

This post walks you through the alternative of using automated annotation based upon Wikipedia content.

From the post:

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).

This is also an opportunity to try out cloud based computing if you are so inclined.

October 21, 2011

Towards georeferencing archival collections

Towards georeferencing archival collections

From the post:

One of the most effective ways to associate objects in archival collections with related objects is with controlled access terms: personal, corporate, and family names; places; subjects. These associations are meaningless if chosen arbitrarily. With respect to machine processing, Thomas Jefferson and Jefferson, Thomas are not seen as the same individual when judging by the textual string alone. While EADitor has incorporated authorized headings from LCSH and local vocabulary (scraped from terms found in EAD files currently in the eXist database) almost since its inception, it has not until recently interacted with other controlled vocabulary services. Interacting with EAC-CPF and geographical services is high on the development priority list.

geonames.org

Over the last week, I have been working on incorporating geonames.org queries into the XForms application. Geonames provides stable URIs for more than 7.5 million place names internationally. XML representations of each place are accessible through various REST APIs. These XML datastreams also include the latitude and longitude, which will make it possible to georeference archival collections as a whole or individual items within collections (an item-level indexing strategy will be offered in EADitor as an alternative to traditional, collection-based indexing soon).

This looks very interesting.

Details:

EADitor project site (Google Code): http://code.google.com/p/eaditor/
Installation instructions (specific for Ubuntu but broadly applies to all Unix-based systems): http://code.google.com/p/eaditor/wiki/UbuntuInstallation
Google Group: http://groups.google.com/group/eaditor

RDFa 1.1 Lite

Filed under: RDFa,Semantic Web — Patrick Durusau @ 7:27 pm

RDFa 1.1 Lite

From the post:

Summary: RDFa 1.1 Lite is a simple subset of RDFa consisting of the following attributes: vocab, typeof, property, rel, about and prefix.

During the schema.org workshop, a proposal was put forth by RDFa’s resident hero, Ben Adida, for a stripped down version of RDFa 1.1, called RDFa 1.1 Lite. The RDFa syntax is often criticized as having too much functionality, leaving first-time authors confused about the more advanced features. This lighter version of RDFa will help authors easily jump into the Linked Data world. The goal was to create a very minimal subset that will work for 80% of the folks out there doing simple markup for things like search engines.

I was struck by the line “…that will work for 80% of the folks out there doing simple markup for things like search engines.”

OK, so instead of people authoring content for the web, the target of RDFa 1.1 Lite targets 80% of SEOs?

Targeting people who try to game search engine algorithms? Not a terribly sympathetic group.

Document Management System with CouchDB

Filed under: CouchDB,Document Management — Patrick Durusau @ 7:27 pm

Document Management System with CouchDB

I mention this series of posts as a way to become acquainted with CouchDB, not as a tutorial on writing a document management system. Or at least not one for production use.

For my class:

  1. You don’t have to read the code, skip to the end of part 3 to the “simple” user interface. Make a list (one page or less) of what is missing from this “document management” system.
  2. What other document management systems are you familiar with? (If not any, check with me I will assign you one.) Make a one page feature list from the “other” document management system and mark which ones are present/absent in this system.

Not strictly a topic map issue but you are going to encounter people who say software is sufficient if it does X, particularly when you want Y. This is in part to prepare you to win those conversations.

ForceAtlas2

Filed under: Gephi,Graphs,Networks,Social Graphs,Social Networks — Patrick Durusau @ 7:27 pm

ForceAtlas2 (paper) +appendices by Mathieu Jacomy, Sebastien Heymann, Tommaso Venturini, and Mathieu Bastian.

Abstract:

ForceAtlas2 is a force vector algorithm proposed in the Gephi software, appreciated for its simplicity and for the readability of the networks it helps to visualize. This paper presents its distinctive features, its energy-model and the way it optimizes the “speed versus precision” approximation to allow quick convergence. We also claim that ForceAtlas2 is handy because the force vector principle is unaffected by optimizations, offering a smooth and accurate experience to users.

I knew I had to cite this paper when I read:

These earliest Gephi users were not fully satisfied with existing spatialization tools. We worked on empirical improvements and that’s how we created the first version of our own algorithm, ForceAtlas. Its particularity was a degree-dependant repulsion force that causes less visual cluttering. Since then we steadily added some features while trying to keep in touch with users’ needs. ForceAtlas2 is the result of this long process: a simple and straightforward algorithm, made to be useful for experts and profanes. (footnotes omitted, emphasis added)

Profanes. I like that! Well, rather I like the literacy that enables a writer to use that in a technical paper.

Highly recommended paper.

Scala Videos (and ebook)

Filed under: Akka,Scala — Patrick Durusau @ 7:27 pm

Scala Videos (and ebook)

While looking for something else (isn’t that always the case?) I ran across this collection of Scala videos and a free ebook, Scala for the Impatient, at Typesafe.

Something to enjoy over the weekend!

CDH3 update 2 is released (Apache Hadoop)

Filed under: Hadoop,Hive,Mahout,MapReduce,Pig — Patrick Durusau @ 7:27 pm

CDH3 update 2 is released (Apache Hadoop)

From the post:

There are a number of improvements coming to CDH3 with update 2. Among them are:

  1. New features – Support for Apache Mahout (0.5). Apache Mahout is a popular machine learning library that makes it easier for users to perform analyses like collaborative filtering and k-means clustering on Hadoop. Also added in update 2 is expanded support for Apache Avro’s data file format. Users can:
  • load data into Avro data files in Hadoop via Sqoop or Flume
  • run MapReduce, Pig or Hive workloads on Avro data files
  • view the contents of Avro files from the Hue web client

This gives users the ability to use all the major features of the Hadoop stack without having to switch file formats. Avro file format provides added benefits over text because it is faster and more compact.

  1. Improvements (stability and performance) – HBase in particular has received a number of improvements that improve stability and recoverability. All HBase users are encouraged to use update 2.
  2. Bug fixes – 50+ bug fixes. The enumerated fixes and their corresponding Apache project jiras are provided in the release notes.

Update 2 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). Check out the installation docsfor instructions. If you’re running components from the Cloudera Management Suite they will not be impacted by moving to update 2. The next update (update 3) for CDH3 is planned for January, 2012.

Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.

Another aspect of Cloudera’s support for the Hadoop ecosystem is its Cloudera University.

Hypertable 0.9.5.1 Binary Packages

Filed under: Hypertable,NoSQL — Patrick Durusau @ 7:26 pm

Hypertable 0.9.5.1 Binary Packages

New release (up from 0.9.5.0) of Hypertable.

You can see the Release Notes. It is slow going but a large number of bugs have been fixed and new features added.

The Hypertable Manual.

I have the sense that the software has a lot of potential but the website doesn’t offer enough examples to make that case. In fact, you have to hunt for the manual (it is linked above and/or has a link on the downloads page). Users even (esp?) developers aren’t going to work very hard to evaluate a new and/or unknown product. Better marketing would help Hypertable.

Using MongoDB in Anger

Filed under: Database,Indexing,MongoDB,NoSQL — Patrick Durusau @ 7:26 pm

Using MongoDB in Anger

Tips on building high performance applications with MongoDB.

Covers four topics:

  • Schema design
  • Indexing
  • Concurrency
  • Durability

Excellent presentation!

One of the first presentations I have seen that recommends a book about another product. Well, High Performance MySQL and MongoDB in Action.

Build Hadoop from Source

Filed under: Hadoop,MapReduce,NoSQL — Patrick Durusau @ 7:26 pm

Build Hadoop from Source by Shashank Tiwari.

From the post:

If you are starting out with Hadoop, one of the best ways to get it working on your box is to build it from source. Using stable binary distributions is an option, but a rather risky one. You are likely to not stop at Hadoop common but go on to setting up Pig and Hive for analyzing data and may also give HBase a try. The Hadoop suite of tools suffer from a huge version mismatch and version confusion problem. So much so that many start out with Cloudera’s distribution, also know as CDH, simply because it solves this version confusion disorder.

Michael Noll’s well written blog post titled: Building an Hadoop 0.20.x version for HBase 0.90.2, serves as a great starting point for building the Hadoop stack from source. I would recommend you read it and follow along the steps stated in that article to build and install Hadoop common. Early on in the article you are told about a critical problem that HBase faces when run on top of a stable release version of Hadoop. HBase may loose data unless it is running on top an HDFS with durable sync. This important feature is only available in the branch-0.20-append of the Hadoop source and not in any of the release versions.

Assuming you have successfully, followed along Michael’s guidelines, you should have the hadoop jars built and available in a folder named ‘build’ within the folder that contains the Hadoop source. At this stage, its advisable to configure Hadoop and take a test drive.

A quick guide to “kicking the tires” as it were with part of the Hadoop eco-system.

I first saw this in the NoSQL Weekly Newsletter from http://www.NoSQLWeekly.com.

Why are programmers smarter than runners?

Filed under: CS Lectures,Programming — Patrick Durusau @ 7:26 pm

Simple Made Easy by Rich Hickey.

A very impressive presentation that answers the burning question:

Why are programmers smarter than runners?

Summary:

Rich Hickey emphasizes simplicity’s virtues over easiness’, showing that while many choose easiness they may end up with complexity, and the better way is to choose easiness along the simplicity path.

This talk will give you the tools and rhetoric to argue for simplicity in programming. I was particularly impressed by the guardrail argument with regard to testing suites. Watch the video and you will see what I mean.

October 20, 2011

We Are Not Alone!

Filed under: Fuzzy Sets,Vector Space Model (VSM),Vectors — Patrick Durusau @ 6:43 pm

While following some references I ran across: A proposal for transformation of topic-maps into similarities of topics (pdf) by Dr. Dominik Kuropka.

Abstract:

Newer information filtering and retrieval models like the Fuzzy Set Model or the Topic-based Vector Space Model consider term dependencies by means of numerical similarities between two terms. This leads to the question from what and how these numerical values can be deduced? This paper proposes an algorithm for the transformation of topic-maps into numerical similarities of paired topics. Further the relation of this work towards the above named information filtering and retrieval models is discussed.

Based in part on his paper Topic-Based Vector Space (2003).

This usage differs from ours in part because the work is designed to work at the document level in a traditional IR type context. “Topic maps,” in the ISO sense, are not limited to retrieval of documents or comparison by a particular method, however useful that method may be.

Still, it is good to get to know one’s neighbours so I will be sending him a note about our efforts.

Learning Richly Structured Representations From Weakly Annotated Data

Filed under: Artificial Intelligence,Computer Science,Machine Learning — Patrick Durusau @ 6:42 pm

Learning Richly Structured Representations From Weakly Annotated Data by Daphne Koller. (DeGroot Lecture, Carnegie Mellon University, October 14, 2011).

Abstract:

The solution to many complex problems require that we build up a representation that spans multiple levels of abstraction. For example, to obtain a semantic scene understanding from an image, we need to detect and identify objects and assign pixels to objects, understand scene geometry, derive object pose, and reconstruct the relationships between different objects. Fully annotated data for learning richly structured models can only be obtained in very limited quantities; hence, for such applications and many others, we need to learn models from data where many of the relevant variables are unobserved. I will describe novel machine learning methods that can train models using weakly labeled data, thereby making use of much larger amounts of available data, with diverse levels of annotation. These models are inspired by ideas from human learning, in which the complexity of the learned models and the difficulty of the training instances tackled changes over the course of the learning process. We will demonstrate the applicability of these ideas to various problems, focusing on the problem of holistic computer vision.

If your topic map application involves computer vision, this is a must see video.

For text/data miners, are you faced with similar issues? Limited amounts of richly annotated training data?

I saw a slide, will run it down later, that had text running from plain text to annotated with ontological data. I mention that because that isn’t what a user sees when they “read” a text. They see implied relationships, references to other subjects, other instances of a particular subject, and all that passes in the instance of recognition.

Perhaps the problem of correct identification in text is one of too few dimensions than too many.

Getting Started with MMS

Filed under: MongoDB — Patrick Durusau @ 6:41 pm

Getting Started with MMS by Kristina Chodorow.

From the post:

Telling someone “You should set up monitoring” is kind of like telling someone “You should exercise 20 minutes three times a week.” Yes, you know you should, but your chair is so comfortable and you haven’t keeled over dead yet.

For years*, 10gen has been planning to do monitoring “right,” making it painless to monitor your database. Today, we released the MongoDB Monitoring Service: MMS.

MMS is free hosted monitoring for MongoDB. I’ve been using it to help out paying customers for a while, so I thought I’d do a quick post on useful stuff I’ve discovered (documentation is… uh… a little light, so far).

MongoDB folks will find this post quite useful.

AutoComplete with Suggestion Groups

Filed under: AutoComplete,Clustering,Interface Research/Design — Patrick Durusau @ 6:41 pm

AutoComplete with Suggestion Groups from Sematext.

From the post:

While Otis is talking about our new Search Analytics (it’s open and free now!) and Scalable Performance Monitoring (it’s also open and free now!) services at Lucene Eurocon in Barcelona Pascal, one of the new members of our multinational team at Sematext, is working on making improvements to our search-lucene.com and search-hadoop.com sites. One of the recent improvements is on the AutoComplete functionality we have there. If you’ve used these sites before, you may have noticed that the AutoComplete there now groups suggestions. In the screen capture below you can see several suggestion groups divided with pink lines. Suggestions can be grouped by any criterion, and here we have them grouped by the source of suggestions. The very first suggestion is from “mail # general”, which is our name for the “general” mailing list that some of the projects we index have. The next two suggestions are from “mail # user”, followed by two suggestions from “mail # dev”, and so on. On the left side of each suggestion you can see icons that signify the type of suggestion and help people more quickly focus on types of suggestions they are after.

Very nice!

Curious how you would distinguish “grouping suggestions” from faceted navigation? (I have a distinction in mind but curious about yours.)

Findability is just So Last Year

Filed under: Findability,Searching — Patrick Durusau @ 6:37 pm

Findability is just So Last Year by Tony Russell-Rose.

From the post:

Last week I attended the October edition of the London Enterprise Search meetup, which gave us (among other things) our usual monthly fix of great talks and follow up discussions. This time, one of the topics that particularly caught my attention was the question of how to measure the effectiveness of enterprise search. Several possible approaches were suggested, including measuring how frequently users can “find what they are looking for” within a fixed period of time (e.g. two minutes).

Now I’m not saying findability isn’t important, but in my opinion metrics like this really seem to miss the point. Leaving aside the methodological issues in defining exactly what is meant by “find what they are looking for”, they seem predicated on the notion that search is all about finding known items, as if to suggest that once they’re found, everyone can go home. In my experience, nothing could be further from the truth.

Not to ask it too sharply but are topic maps by themselves one-trick ponies?

A topic map can collocate information about a subject, along with associations to other subjects, perhaps with a bow on top, but then what?

What if there were an expectation of collocation of information as part of search engines and other information interfaces?

Topic maps as an embedded component in information interfaces?

In Going Head to Head with Google we saw how a speciality search engine could eat Google’s lunch on relevant search results. What about speciality collocation rather than “we omitted some duplicate results” type stuff?

Would not have to take over the click-through business all at once but like my father-in-law would say: “…just like eating lettuce, one leaf at a time.”

BioGene 1.1 – Information Tool for Biological Research for Iphone

Filed under: Bioinformatics,Search Algorithms,Searching — Patrick Durusau @ 6:37 pm

BioGene 1.1 – Information Tool for Biological Research for Iphone

From the website:

BioGene is an information tool for biological research. Use BioGene to learn about gene function. Enter a gene symbol or gene name, for example “CDK4″ or “cyclin dependent kinase 4″ and BioGene will retrieve its gene function and references into its function (GeneRIF).

The search/match criteria of Biogene is instructive:

Where does BioGene get its data?
BioGene provides primary information from Entrez Gene, a searchable database hosted by the NCBI.
What is a GeneRIF?
A GeneRIF is a functional annotation of a gene described in Entrez Gene. The annotation includes a link to a citation within PubMed which describes the function. Please see GeneRIF for more information.
How does BioGene search Entrez Gene?
BioGene attempts to match a query against a gene name (symbol). If no matching records are found, BioGene applies mild increments in permissiveness until a match is found. For example, if we are searching for the following single-term query, “trk”, BioGene will attempt the following sequence of queries in succession, stopping whenever one or more matching records is returned:

  • search for a gene name (symbol) that matches the exact sequence of characters “trk”
  • search for a gene name (symbol) that starts with the sequence of characters “trk”
  • search for a gene name (symbol) that contains the sequence of characters “trk” within a word
  • perform a free text search that matches the exact sequence of characters “trk”
  • perform a free text search that starts with the sequence of characters “trk”
  • perform a free text search that contains the sequence of characters “trk” within a word

In Entrez Gene parlance, for the following single-term query “trk”, the following sequence of queries is attempted:

  • trk[pref]
  • trk*[pref] OR trk[sym]
  • trk*[sym]
  • *trk*[sym]
  • trk[All Fields]
  • trk*[All Fields]
  • *trk*[All Fields]

If however, we are searching for the following multi-term query, “protein kinase 4”, BioGene will attempt the following sequence of queries in succession, stopping whenever one or more matching records is returned:

  • search for a full gene name that matches the exact sequence of characters “protein kinase 4”
  • perform a free text search that contains every term in the multi-term query “protein kinase 4”
  • perform a free text search that contains one of the terms in the multi-term query “protein kinase 4”

In Entrez Gene parlance, for the following multi-term query “protein kinase 4”, the following sequence of queries is attempted:

  • protein+kinase+4[gene full name]
  • protein[All Fields] OR kinase[All Fields] OR 4[All Fields]

If BioGene detects one or more of the following character sequences within the query:

[   ]   *   AND   OR   NOT

it treats this as an “advanced” query and passes the query directly to Entrez Gene. In this situation, BioGene ignores the organism filter specified in the application settings and expects the user to embed this filter within the query.

BioGene gives you access to subjects by multiple identifiers and at least for a closed set of data, attempts to find a “best” match.

Review Entrez Gene.

What other identifiers suggest themselves as bases of integrating known sources of additional information?

(Only “known” sources can be meaningfully displayed/integrated for the user.)

Search Analytics for Your Site

Filed under: Authoring Topic Maps,Search Analytics — Patrick Durusau @ 6:36 pm

Search Analytics for Your Site

From the website:

Any organization that has a searchable web site or intranet is sitting on top of hugely valuable and usually under-exploited data: logs that capture what users are searching for, how often each query was searched, and how many results each query retrieved. Search queries are gold: they are real data that show us exactly what users are searching for in their own words. This book shows you how to use search analytics to carry on a conversation with your customers: listen to and understand their needs, and improve your content, navigation and search performance to meet those needs.

I haven’t read this book so don’t take this post as an endorsement or “buy” recommendation.

While watching the slide deck, it occurred to me that if search analytics could improve your website, why not use search analytics to develop the design and content of a topic map?

The design aspect in the sense that the most prominent, easiest to use/find content is what is popular with users. Could even be by time of the day if you have a topic map that is accessible 24 x 7.

The content aspect in the sense of what is included, what we say about it and perhaps how it is findable is based on search analysis.

If you were developing a topic map about Sarah Palin, perhaps searching for “dude” should return her husband as a topic. I can think of other nicknames but this isn’t a political blog.

Comments on this book or suggestions of other search analytics resources appreciated.

Large-scale Pure OO at the Irish Government

Filed under: Domain Driven Design,Naked Objects — Patrick Durusau @ 6:34 pm

Large-scale Pure OO at the Irish Government by Richard Pawson.

From the website:

Richard Pawson discusses a case study of a large pure OO project for the Irish government, presenting the challenges met, the reason for choosing pure OO, and lessons learned implementing it.

This is an important presentation mostly because of Pawson’s reliance on Domain Driven Design (Evans) and the benefits that were derived from that approach. I think you will find a lot of synergy with extracting from users the “facts” about their domains. Highly entertaining presentation.

Pawson’s “naked object” approach has both Java and .Net implementations:

Naked Objects – .Net at Codeplex

Apache Isis – Java at Apache Incubator

Perhaps not quite right for web-based or casual users, but for power-users of topic maps, this might have some promise. Thoughts?

Pawson talks about reuse of naked objects. How would you compose (impose?) a subject identifier for a “naked object?”

Tech Survivors: Geek Technologies…

Filed under: Language,Pattern Matching,Software — Patrick Durusau @ 6:33 pm

Tech Survivors: Geek Technologies That Still Survive 25 to 50 Years Later

Simply awesome!

Useful review for a couple of reasons:

First: New languages, formats, etc., will emerge but legacy systems “….will be with you always.” (Or at least it will feel that way so being able to interface with legacy systems (understand their semantics) is going to be important for very long time.)

Second: What was it about these technologies that made them succeed? (I don’t have the answer or I would be at the USPTO filing every patent and variant of patent that I could think of. 😉 It is clearly non-obvious because no one else is there either.) Equally long-lived technologies are with us today, we just don’t know which ones.

Would not hurt to put this on your calendar to review every year or so. The more you know about new technologies, the more likely you are to spot a resemblance or pattern matching one of these technologies. Maybe.

The Habit of Change

Filed under: Data Mining — Patrick Durusau @ 4:18 am

The Habit of Change by David Alan Grier appears in the October 2011 issue of Computer.

An interesting account of the differences in how statisticians and computer science data miners view the processing of data. And how new techniques crowd out old ones and little heed is paid to prior results. Not all that undeserved, the lack of heed for prior results, because prior waves of change had the same disdain for their predecessors.

Not that I agree with that disdain, but it is good to be reminded that if we paid a bit more attention to the past, perhaps we could make new mistakes rather that repeating old ones.

Note that we also replace old terminologies with new ones, which makes matching up old mistakes with current ones more difficult.

Apologies for the link above that takes you to a pay-per-view option.

If you like podcasts, try: The Known World: The Habit of Change being read by David Alan Grier.

« Newer PostsOlder Posts »

Powered by WordPress