Archive for August, 2011

Semantic Web Journal – Vol. 2, Number 2 / 2011

Wednesday, August 31st, 2011

Semantic Web Journal – Vol. 2, Number 2 / 2011

Just in case you want to send someone the link to a particular article:

Semantic Web surveys and applications
DOI 10.3233/SW-2011-0047 Authors Pascal Hitzler and Krzysztof Janowicz

Taking flight with OWL2
DOI 10.3233/SW-2011-0048 Author Michel Dumontier

Comparison of reasoners for large ontologies in the OWL 2 EL profile
DOI 10.3233/SW-2011-0034 Authors Kathrin Dentler, Ronald Cornet, Annette ten Teije and Nicolette de Keizer

Approaches to visualising Linked Data: A survey
DOI 10.3233/SW-2011-0037 Authors Aba-Sah Dadzie and Matthew Rowe

Is Question Answering fit for the Semantic Web?: A survey
DOI 10.3233/SW-2011-0041 Authors Vanessa Lopez, Victoria Uren, Marta Sabou and Enrico Motta

FactForge: A fast track to the Web of data
DOI 10.3233/SW-2011-0040 Authors Barry Bishop, Atanas Kiryakov, Damyan Ognyanov, Ivan Peikov, Zdravko Tashev and Ruslan Velkov

Couchbase Server 2.0 – Up and Running

Wednesday, August 31st, 2011

Couchbase Server 2.0 – Up and Running

A 5 minute video to get Couchbase Server 2.0 up and running.

Almost makes me wish I was a sysadmin again. They never met some of my users. 😉 Note I said almost made me wish. (shudder)

You are all brighter than that and so should have no problems with the five minute limit.

Don’t overlook the sign-up for the tech webinar series.

Curious about Couchbase Server 2.0?

Couchbase Server 2.0 Developer Release is now available! This new release combines the unmatched elastic data management capabilities of Membase Server with the distributed indexing and querying capabilities of Apache CouchDB to deliver the industry’s most powerful, bullet-proof NoSQL database technology.

Come to a series of weekly 30-minute webinars to learn more about the technical details of Couchbase Server 2.0.

This nine-week webinar series will cover:

-Couchbase Server 2.0 overview
-Indexing and querying basics
-SDKs/client libraries (including Moxi Server)
-Development/production View usage
-Advanced indexing and querying
-Clustering and monitoring
-Auto compaction
-Upgrading to 2.0 from Membase Server
-Cross data center replication

Whether you are currently running CouchBase or not, this could be interesting.

BigData University

Wednesday, August 31st, 2011

BigData University

From the website:

Why Register?

Easy and Affordable Learning Hadoop and other Big Data technologies has never been more affordable! Many courses are FREE!

Latest industry trends Acquire valuable skills and get updated about industry’s latest trends right here. Today!

Learn from the Experts! Big Data University offers education about Hadoop and other technologies by the industry’s best!

Learn at your Own Pace! Find everything right here when you need it and from wherever you are.

A demonstration of the power of social media. I start complaining about use of the word “big” (Big Learning 2011) and someone starts a university using “big” in the name. Random chance? I don’t think so. 😉

I’m signing up for the free Hadoop course.

BTW, did you notice that they have a “Creating a course in DB2 University” offering? (Yeah, I know, they forgot to change all the names. They will get around to it.)

Will have to see what the course software looks like. Could have possibilities for topic maps.

The VLTS Benchmark Suite

Wednesday, August 31st, 2011

The VLTS Benchmark Suite

From the webpage:

The VLTS acronym stands for “Very Large Transition Systems“.

The VLTS benchmark suite is a collection of Labelled Transition Systems (hereafter called benchmarks).

Each Labelled Transition System is a directed, connected graph, whose vertices are called states and whose edges are called transitions. There is one distinguished vertex called the initial state. Each transition is labelled by a character string called action or label. There is one distinguished label noted “i” that is used for so-called invisible transitions (also known as hidden transitions or tau-transitions).

The VLTS benchmarks have been obtained from various case studies about the modelling of communication protocols and concurrent systems. Many of these case studies correspond to real life, industrial systems.

If you aren’t already working with large graphs in your work on topic maps, you will be.

HipG: Parallel Processing of Large-Scale Graphs

Wednesday, August 31st, 2011

HipG: Parallel Processing of Large-Scale Graphs


Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined pieces of sequential work on graph nodes. To make the user code high-level, the framework provides a unified interface to executing methods on local and non-local graph nodes and an abstraction of exclusive execution. The graph computations are managed by logical objects called synchronizers, which we used, for example, to implement distributed divide-and-conquer decomposition into strongly connected components. The code written in HipG is independent of a particular graph representation, to the point that the graph can be created on-the-fly, i.e. by the algorithm that computes on this graph, which we used to implement a distributed model checker. HipG programs are in general short and elegant; they achieve good portability, memory utilization, and performance.

Graphs are stored in SVC-II distributed graph format described in Compressed and Distributed File Formats for Labeled Transition Systems by Stefan Blom, Izak van Langevelde, and Bert Lissera. (Electronic Notes in Theoretical Computer Science Volume 89, Issue 1, September 2003, Pages 68-83 PDMC 2003, Parallel and Distributed Model Checking (Satellite Workshop of CAV ’03)) [The abstract is so vague as to be useless. I tried to find an “open” copy of the paper but failed. Can you point to one?]


From the implementation webpage:

HipG is a library for high-level parallel processing of large-scale graphs. HipG is implemented in Java and is designed for distributed-memory machine. Besides basic distributed graph algorithms it handles divide-and-conquer graph algorithms and algorithms that execute on graphs created on-the-fly. It is designed for clusters of machines, but can also be tested on desktops – all you need is a recent Java runtime environment. HipG is work in progress! (as of Apr’11)

Information As Art: 20 Stunning Examples Of Visualized Data

Wednesday, August 31st, 2011

FInformation As Art: 20 Stunning Examples Of Visualized Data by Brad Chacos.

From the website:

Numbers, percentages, bits of data; normally, we tend to look at these tidbits as information, useful for statistical analysis and not much more. Accounting isn’t sexy. Spreadsheet programmers don’t cultivate the same star power as lead programmers on video games. But numbers and raw data hold a unique and powerful allure their own – just ask John Carmack.

Unfortunately, if you aren’t one of those aforementioned accountants or spreadsheet programmers, seeing the art in numbers can be tough. Data visualization changes that. By changing the way we look at ratios and integers and statistical anomalies and giving us the power to actually see the relationship between sets of inputs, data visualization brings a sense of wonder and humanity back to statistical analysis. And no, we never thought we’d ever say anything like that. We blame that Carmack guy.

Don’t believe the hype? Check out the twenty examples below and we think you’ll concur that data can be art. There’s a bonus if you make it all the way to the end!

One commenter mentioned:

Processing, open source programming language for graphics, animation, etc.

Cinder, graphics library for C++.

OpenFrameworks, “an open source C++ toolkit for creative coding.”

As places to see more visualizations.

Visualization is important as part of mining data, which you can then capture in a topic map as well as being a means of delivering topic map based content to your users.

(from @StatFact, RT from Joab_Jackson)

Turtles all the way down

Wednesday, August 31st, 2011

Turtles all the way down

From the website:

Decisive breakthrough from IBM researchers in Haifa introduces efficient nested virtualization for x86 hypervisors

What is nested virtualization and who needs it? Classical virtualization takes a physical computer and turns it into multiple logical, or virtual, computers. Each virtual machine can then interact independently, run its own operating environment, and basically behave like a separate physical resource. Hypervisor software is the secret sauce that makes virtualization possible by sitting in between the hardware and the operating system. It manages how the operating system and applications access the hardware.

IBM researchers found an efficient way to take one x86 hypervisor and run other hypervisors on top of it. For virtualization, this means that a virtual machine can be ‘turned into’ many machines, each with the potential to have its own unique environment, configuration, operating system, or security measures—which can in turn each be divided into more logical computers, and so on. With this breakthrough, x86 processors can now run multiple ‘hypervisors’ stacked, in parallel, and of different types.

This nested virtualization using one hypervisor on top of another is reminiscent of a tale popularized by Stephen Hawking. A little old lady argued with a lecturing scientist and insisted that the world is really a flat plate supported on the back of a giant tortoise. When the scientist asked what the tortoise is standing on, the woman answered sharply “But it’s turtles all the way down!” Inspired by this vision, the researchers named their solution the Turtles Project: Design and Implementation of Nested Virtualization

This awesome advance has been incorporated into the latest Linux release.

This is what I like about IBM, fundamental advances in computer science that can be turned into services for users.

One obvious use of this advance would be to segregate merging models in separate virtual machines. I am sure there are others.

Big Learning 2011

Wednesday, August 31st, 2011

Big Learning 2011 : Big Learning: Algorithms, Systems, and Tools for Learning at Scale


When Dec 16, 2011 – Dec 17, 2011
Where Sierra Nevada, Spain
Submission Deadline Sep 30, 2011
Notification Due Oct 21, 2011
Final Version Due Nov 11, 2011

From the call:

Big Learning: Algorithms, Systems, and Tools for Learning at Scale

NIPS 2011 Workshop (

Submissions are solicited for a two day workshop December 16-17 in Sierra Nevada, Spain.

This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields including machine learning, databases, parallel and distributed systems, parallel architectures, and programming languages and abstractions. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):

It looks like an interesting conference but “big” doesn’t add anything.

To head off future “big” clutter, I hereby claim copyright, trademark, etc., protection under various galactic and inter-galactic treaties and laws for:

  • big blogging
  • big tweeting
  • big microformats
  • big IM
  • big IM’NOT
  • big smileys
  • big imaginary but not instantiated spaces
  • big cells
  • big things that are not cells
  • big words that look like CS at a distance
  • big …. well, I will be expanding this list with your non-obscene suggestions, provided you transfer ownership to me.

Interactive Maps With Polymaps, TileStach, and MongoDB

Wednesday, August 31st, 2011

Interactive Maps With Polymaps, TileStach, and MongoDB

For the impatient: Checkout Interactive Map of Twitter Weight Loss Goals (very slick)

From Alex Popescu’s myNoSQL:

A three part tutorial on using MongoDB, PostgreSQL/PostGIS, and Javascript libraries for building interactive maps by Hans Kuder:

  • part 1: goals and building blocks
  • part 2: geo data, PostGIS, and TileStache
  • part 3: client side and MongoDB

Visiting part 1 for a larger taste of the project you find:

I’d been toying around with ideas for cool ancillary features for Goalfinch for a while, and finally settled on creating this interactive map of Twitter weight loss goals. I knew what I wanted: a Google-maps-style, draggable, zoomable, slick-looking map, with the ability to combine raster images and style-able vector data. And I didn’t want to use Flash. But as a complete geographic information sciences (GIS) neophyte, I had no idea where to start. Luckily there are some new technologies in this area that greatly simplified this project. I’m going to show you how they all fit together so you can create your own interactive maps for the browser.


The main components of the weight loss goals map are:

  1. Client-side Javascript that assembles the map from separate layers (using Polymaps)
  2. Server-based application that provides the data for each layer (TileStache, MongoDB, PostGIS, Pylons)
  3. Server-based Python code that runs periodically to search Twitter and update the weight loss goal data

I’ll cover each component separately in upcoming posts, but I’ll start with a high-level description of how the components work together for those of you who are new to web-based interactive maps.

Let your imagination run wild with the interactive maps that you can assemble and populate with topic map based data.

Approaching and evaluating NoSQL

Wednesday, August 31st, 2011

Approaching and evaluating NoSQL by Mårten Gustafson.

From the webpage:

Brown bag lunch presentation at TUI / Fritidsresor about approaching and evaluating the NoSQL area. Embedded presentation below, downloadable as PDF and Keynote.

A “brown bag lunch” presentation that balances detail with ideas so listeners will follow it with interesting conversations and research.

The “use case” approach lends itself to exploring “why” someone would want to use a NoSQL database as opposed to the usual mantra that NoSQL databases are flexible, scalable and fast.

So? If my data format is fixed, not all that large (under a few terabytes), and I run batch reports, I may not need a NoSQL database. Could, depends on the facts/use cases. This presentation gets high marks for its “use case” approach.

LC Name Authority File Available as Linked Data

Tuesday, August 30th, 2011

LC Name Authority File Available as Linked Data

From Legal Informatics Blog:

The Library of Congress has made available the LC Name Authority File as Linked Data.

The data are available in several formats, including RDF/XML, N-Triples, and JSON.

Of particular interest to the legal informatics community is the fact that the Linked Data version of the LC Name Authority File includes records for names of very large numbers of government entities — as well as of other kinds of organizations, such as corporations, and individuals — of the U.S., Canada, the U.K., France, India, and many other nations. The file also includes many records for individual statutes.

Interesting post that focuses on law related authority records.

Social Network Analysis and Visualization

Tuesday, August 30th, 2011

Social Network Analysis and Visualization by William J. Turkel.

From the post:

In April 2008, I posted an article in my blog Digital History Hacks about visualizing the social network of NiCHE: Network in Canadian History & Environment as it was forming. We now use custom programs written in Mathematica to explore and visualize the activities of NiCHE members, and to assess our online communication strategies. Some of the data comes from our online directory, where members can contribute information about their research interests and activities. Some of it comes from our website server logs, and some of it is scraped from social networking sites like Twitter. A handful of examples are presented here, but the possibilities for this kind of analysis are nearly unbounded.

Some findings from exploring the data set:

People who are interested in fisheries seem not to be interested in landscape, and vice versa. Why not? A workshop that tried to bring both groups together to search for common ground might lead to new insights.

This graph suggests that NiCHE members who are interested in subjects that focus on material evidence over very long temporal durations are relatively marginal in the knowledge cluster, and may not be well connected even with one another.

From this figure it is easy to see that Darin Kinsey is the only person who has claimed to be interested in both landscapes and fisheries. If we did decide to hold a workshop on the intersection of those two topics, he might be the ideal person to help organize it.

This figure shows that the NiCHE Twitter audience includes a relatively dense network of scholars who identify themselves either as digital humanists or as Canadian / environmental historians or geographers. There is also a relatively large collection of followers who do not appear to have many connections with one another.

What’s hiding in your data set?

Databases – Humanities and Social Sciences

Tuesday, August 30th, 2011

Applications of Databases to Humanities and Social Sciences

NoSQL, Semantic Web and topic maps will mean little if you don’t understand the technical backdrop for those developments.

Library students take note: SQL databases are very common in libraries and academia in general so what you learn here will be doubly useful.

Demo App: PynIT!

Tuesday, August 30th, 2011

Demo App: PynIT!

A demo of py2neo:

There’s possibly no better way to demonstrate a code library such as py2neo than to show it in action. PynIT! is a simple Flask-based bookmarking/URL-shortening application which stores its bookmarks in a Neo4j database via (of course) py2neo.

Graph Processing versus Graph Databases

Tuesday, August 30th, 2011

Graph Processing versus Graph Databases

Jim Webber describes the different problems addressed by graph processing and graph databases. Worth reading so you will pick the correct tool for the problem you are facing.

Webber visualizes the following distinctions:

What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios – much like a regular RDBMS, only with a more expressive and powerful data model.

See the post for the graphic.

MongoDB 2.0.0-rc0

Tuesday, August 30th, 2011

MongoDB 2.0.0-rc0 was released 25 August 2011.

Check out the latest release or download a stable version at:

MongoDB homepage

Persistent Data Structures and Managed References

Monday, August 29th, 2011

Persistent Data Structures and Managed References: Clojure’s approach to Identity and State by Rich Hickey.

From the summary:

Rich Hickey’ presentation is organized around a number of programming concepts: identity, state and values. He explains how to represent composite objects as values and how to deal with change and state, as it is implemented in Clojure.

OK, it’s not recent, circa 2009, but it is quite interesting.

Some tidbits to entice you to watch the presentation:

Identity – A logical entity we associate with a series of causally related values (states) over time.

Represent objects as composite values

Persistent Data Structures preserves old values as immutable

Bit-partitioned hash tries 32-bit

Structural sharing – path copying

Persistent data structures provide efficient immutable composite values

When I saw the path copying operations that efficiently maintain immutable values I immediately thought of Steve Newcomb and Versavant. 😉


Monday, August 29th, 2011


From the website:

Every word becomes a link to the most powerful services on the internet – Google, Wikipedia, translations, conversions and much more.

Is available as a plugin for Firefox, Chrome and Safari web browsers. A beta version is being tested for IE, Office and PDFs.

You can select a single word or a group of words or numbers.

It can also be licensed for use with a website and that enables you to customize the user’s experience.

Very high marks for a user friendly interface. Even casual users know how to select text, although want to do with it next may prove to be a big step. Still, “click on the icon” should be as easy to remember as “use the force Luke!,” at least with enough repetition.

I am curious about the degree of customization that is possible with a licensed copy for a website. Quite obviously thinking about using words on a website or some known set of website as keys into a topic map backend.

This could prove to be a major step forward for all semantic-based services.

Very much a watch this space service.


Monday, August 29th, 2011


From the website:

QuaaxTM is a PHP ISO/IEC 13250 Topic Maps engine which implements PHPTMAPI. This enables developers to work against a standardized API. QuaaxTM uses MySQL with InnoDB as storage engine and benefits from transaction support and referential integrity.

Version 0.7.0 (from the change log):

PHPTMAPI (lib/phptmapi2.0)

  • Allow any datatype for parameter $value in TopicMapSystemFactory::setProperty() (was object only)
  • Changed code style: Added prefix “_” for private class members, set opening brackets for classes / interfaces and class / interface methods on new line


  • Added more tests to increase code coverage in the unit tests (reached >98% lines coverage for the files / classes in the src directory)
  • Defined all INT as UNSIGNED in the QuaaxTM database schema, switched TINYTEXT to equivalent VARCHAR(255) in qtm_variant, changed “INDEX (value(100))” to “INDEX (value(255))” in qtm_occurrence (schema is backward compatible: data from previous schema can be migrated seamlessly)
  • Replaced PropertyUtils by simple Array
  • Introduced a memcached based MySQL result cache (currently only available in AssociationImpl::getRoles(), AssociationImpl::getRoleTypes(), and TopicMapImpl::getAssociations())
  • Introduced MysqlMock for testing the result cache explicitly and enabled passing MysqlMock as TopicMapSystem property via TopicMapSystemFactoryImpl::setProperty()
  • Removed interface IScope from core
  • Changed code style: Added prefix “_” for private and protected class methods and class members, set opening brackets for classes and class methods on new line
  • Added documentation for all class members and class constants

RuSSIR/EDBT 2011 Summer School

Monday, August 29th, 2011

RuSSIR/EDBT 2011 Summer School

Machine learning task with task and training set data.

RuSSIR machine learning contest winners presentations

Contest tasks are described on Results are presented in the previous post:

Yura Perov:

Dmitry Kan and Ivan Golubev:

Nikita Zhiltsov:

Building Search App for Public Mailing Lists

Monday, August 29th, 2011

Building Search App for Public Mailing Lists in 15 Minutes with ElasticSearch by Lukáš Vlček.

You will need the slides to follow the presentation: Building Search App for Public Mailing Lists.

Very cool if fast presentation on building an email search application with ElasticSearch.

BTW, the link to BigDesk (A tiny monitoring tool for ElasticSearch clusters) is incorrect. Try:

IRE: Investigative Reporters and Editors

Monday, August 29th, 2011

IRE: Investigative Reporters and Editors

The IRE sponsors the census data that I pointed out at:

From the about page:

Investigative Reporters and Editors, Inc. is a grassroots nonprofit organization dedicated to improving the quality of investigative reporting.

IRE was formed in 1975 to create a forum in which journalists throughout the world could help each other by sharing story ideas, news gathering techniques and news sources.

Mission Statement

The mission of Investigative Reporters and Editors is to foster excellence in investigative journalism, which is essential to a free society. We accomplish this by:

  • Providing training, resources and a community of support to investigative journalists.
  • Promoting high professional standards.
  • Protecting the rights of investigative journalists.
  • Ensuring the future of IRE.

They are a membership based organization and for $70 (US) per year, you get access to a number of data sets that have been collected by the organization or culled from public sources. Doesn’t hurt to check for sources of data before you go to the trouble of extracting it yourself.

The other reason to mention them is that news organizations seem to like finding connections between people, between people and fraudulent activities, people and sex workers, and other connections that are the bread and butter of topic maps. Particularly when topic maps are combined and new connections become apparent.

So, this is a place where topic maps, or at least the results of using topic maps (not the same thing), may find a friendly reception.

Suggestions of other likely places to pitch either topic maps or the results of using topic maps most welcome!

Monday, August 29th, 2011

From the website:

Investigative Reporters and Editors is pleased to announce the next phase in our ongoing Census project, designed to provide journalists with a simpler way to access 2010 Census data so they can spend less time importing and managing the data and more time exploring and reporting the data. The project is the result of work by journalists from the The Chicago Tribune, The New York Times, USA Today, CNN, the Spokesman-Review (Spokane, Wash.) and the University of Nebraska-Lincoln, funded through generous support from the Donald W. Reynolds Journalism Institute at the Missouri School of Journalism.

You can download bulk data as well as census data in JSON format.

You can browse data by:

Census tracts
Can vary in size but averages 4,000 people. Designed to remain relatively stable across decades to allow statistical comparisons. Boundaries defined by local officials using Census Bureau rules.
1. What most people call cities or towns. A locality incorporated under state law that acts as a local government.
2. An unincorporated area that is well-known locally. Defined by state officals under Census Bureau rules and called a “census designated place.” “CDP” is added to the end of name.
Counties (parishes in LA)
The primary subdivisions of states. To cover the full country, this includes Virginia’s cities and Baltimore, St. Louis and Carson City, Nev., which sit outside counties; the District of Columbia; and the boroughs, census areas and related areas in Alaska.
County Subdivisions
There are 2 basic kinds: 1. In 29 states, they have at least some governmental powers and are called minor civil divisions (MCDs). Their names may include variations on “township,” “borough,” “district,” “precinct,” etc. In 12 of those 29 states, they operate as full-purpose local governments: CT, MA, ME, MI, MN, NH, NJ, NY, PA, RI, VT, WI. 2. In states where there are no MCDs, county subdivisions are primarily statistical entities known as census county divisions. Their names end in “CCD.”

[State and USA.]

Great source of census information for use with other data, even proprietary data in your topic map.

BigSheets or SCOBOL

Sunday, August 28th, 2011

BigSheets: extending business intelligence through web data

From the website:

BigSheets is an extension of the mashup paradigm that:

  • A component of IBM InfoSphere BigInsights solution
  • Integrates gigabytes, terabytes, or petabytes of unstructured data from web-based repositories
  • Collects a wide range of unstructured web data stemming from user-defined seed URLs
  • Extracts and Enriches that data using the unstructured information management architecture you choose (LanguageWare,OpenCalais, etc.)
  • Lets you Explore and Visualize this data in specific, user defined contexts. (such as ManyEyes)

I checked and it doesn’t look like BigSheets is included in the basic BigInsights edition (the free one).

Interesting to think of the problem scenarios for BigSheets as Jeopardy clues:

    • Research and analytics of structured databases result in dated information that cannot properly guide strategies or support decisions.
    • Ans: What is Californication?
    • The reach of your business intelligence data is limited to enterprise databases – providing only a one-sided view of the real business environment.
    • Ans: What is “Where the Wild Things Are?” (semantically diverse data)
    • Customer preferences and website activity are captured only through pre-packaged, outsourced web analytics. There is no way to do it yourself.
    • Ans:What is Semantic-COBOL or SCOBOL? (COBOL being the original DYI programming language for business types)

Serious question: Say you and I separately create data mashups using BigSheets. How do we merge those together so neither one of us has to repeat the work we did creating the mashups? So that the result is the accumulation of our insights?

NPSML Library – C – Machine Learning

Sunday, August 28th, 2011

Naval Postgraduate School Machine Learning Library (NPSML Library)

At present pre-release C based machine learning package. Do note the file format requirements.

Enhancing search results using machine learning

Sunday, August 28th, 2011

Enhancing search results using machine learning by Emmanuel Espina

From the introduction:

To introduce you in the topic let’s think about how the users are used to work with “information retrieval platforms” (I mean, search engines). The user enters your site, sees a little rectangular box with a button that reads “search” besides it, and figures out that he must think about some keywords to describe what he wants, write them in the search box and hit search. Despite we are all very used to this, a deeper analysis of the workings of this procedure leads to the conclusion that it is a quite unintuitive procedure. Before search engines, the action of “mentally extracting keywords” from concepts was not a so common activity.

It is something natural to categorize things, to classify the ideas or concepts, but extracting keywords is a different intellectual activity. While searching, the user must think like the search engine! The user must think “well, this machine will give me documents with the words I am going to enter, so which are the words that have the best chance to give me what I want” (emphasis added)

Hmmmm, but prior to full-text search, users learned how to think like the indexers who created the index they were using. Indexers were a first line of defense against unbounded information as indexes covered particular resources and had mechanisms to account for changing terminology. Not to mention domain specific vocabularies that users could master.

A second line of defense were librarians who not only mastered domain specific indexes but who could also move from one specialized finding aid to another, collating information as they went. The ability to transition from one finding aid is one that has yet to be duplicated by automatic means. In part because it depends on the resources available in a particular library.

Do read the article to see how the author proposes to use machine learning to improve search results.

BTW, do you know of any sets of query responses that are publicly available?

10 Weeks to Lean Integration

Sunday, August 28th, 2011

10 Weeks to Lean Integration by John Schmidt.

From the post:

Lean Integration is a management system that emphasizes focusing on the customer, driving continuous improvements, and the elimination of waste in end-to-end data integration and application integration activities.

Lean practices are well-established in other disciplines such as manufacturing, supply-chain management, and software development to name just a few, but the application of Lean to the integration discipline is new.

Based on my research, no-one has tackled this topic directly in the form of a paper or book. But the world is a big place, so if some of you readers have come across prior works, please let me know. In the meantime, you heard it here first!

The complete list of posts:

Week 1: Introduction of Lean Integration (this posting)
Week 2: Eliminating waste
Week 3: Sustaining knowledge
Week 4: Planning for change
Week 5: Delivering fast
Week 6: Empowering the team
Week 7: Building in quality
Week 8: Optimizing the whole
Week 9: Deming’s 14 Points
Week 10: Practical Implementation Considerations

I don’t necessarily disagree with the notion of reducing variation in an enterprise. I do think integration solutions need to be flexible enough to adapt to variation encountered in the “wild” as it were.

I do appreciate John’s approach to integration that treats it as more than a technical problem. Integration (as other projects) is an organization issue as much as it is a technical one.

The Future of Hadoop

Sunday, August 28th, 2011

The Future of Hadoop – with Doug Cutting and Jeff Hammerbacher

From the description:

With a community of over 500 contributors, Apache Hadoop and related projects are evolving at an ever increasing rate. Join the co-creator of Apache Hadoop, Doug Cutting, and Cloudera’s Chief Scientist, Jeff Hammerbacher, for a discussion of the most exciting new features being developed by the Apache Hadoop community.

The primary focus of the webinar will be the evolution from the Apache Hadoop kernel to the complete Apache Bigtop platform. We’ll cover important changes in the kernel, especially high availability for HDFS and the separation of cluster resource management and MapReduce job scheduling.

We’ll discuss changes to throughout the platform, including support for Snappy-based compression and the Avro data file format in all components, performance and security improvements across all components, and additional supported operating systems. Finally, we’ll discuss new additions to the platform, including Mahout for machine learning and HCatalog for metadata management, as well as important improvements to existing platform components like HBase and Hive.

Both the slides and the recording of this webinar are available but I would go for the recording.

One of the most informative and entertaining webinars I have seen, ever. Cites actual issue numbers and lays out how Hadoop is on the road to becoming a stack of applications that offer a range of data handling and analysis capabilities.

If you are interested in data processing/analysis at any scale, you need to see this webinar.

Road To A Distibuted Search Engine

Sunday, August 28th, 2011

Road To A Distributed Search Engine by Shay Banon.

If you are looking for a crash course on the construction details of Elasticsearch, you are in the right place.

My only quibble and this is common to all really good presentations (this is one of those) is that there isn’t a transcript to go along with it. There is so much information that I will have to watch it more than once to take it all in.

If you watch the presentation, do pay attention so you are not like the person who suggested that Solr and Elasticsearch were similar. 😉

Saturday, August 27th, 2011 OpenSource eDiscovery Engine

Gartner projects that eDiscovery will be a $1.5 Billion market by 2013.

An open source project that compares to or exceeds the capabilities of other solutions would be a very interesting prospect.

Particularly if the software had an inherent capability to merge eDiscovery results from multiple sources, say multiple plaintiffs attorneys who had started on litigation separately, but now need to “merge” their discovery results.