Archive for March, 2013

Build a search engine in 20 minutes or less

Thursday, March 28th, 2013

Build a search engine in 20 minutes or less by Ben Ogorek.

I was suspicious but pleasantly surprised by the demonstration of the vector space model you will find here.

True, it doesn’t offer all the features of the latest Lucene/Solr releases but it will give you a firm grounding on vector space models.


PS: One thing to keep in mind, semantics do not map to vector space. We can model word occurrences in vector space but occurrences are not semantics.

…The Analytical Sandbox [Topic Map Sandbox?]

Thursday, March 28th, 2013

Analytics Best Practices: The Analytical Sandbox by Rick Sherman.

From the post:

So this situation sounds familiar, and you are wondering if you need an analytical sandbox…

The goal of an analytical sandbox is to enable business people to conduct discovery and situational analytics. This platform is targeted for business analysts and “power users” who are the go-to people that the entire business group uses when they need reporting help and answers. This target group is the analytical elite of the enterprise.

The analytical elite have been building their own makeshift sandboxes, referred to as data shadow systems or spreadmarts. The intent of the analytical sandbox is to provide the dedicated storage, tools and processing resources to eliminate the need for the data shadow systems.

Rick outlines what he thinks is needed for an analytical sandbox.

What would you include in a topic map sand box?

Cheminformatics Supplements

Thursday, March 28th, 2013

Cheminformatics Supplements

I ran across a pointer today to abstracts for the 8th German Conference on Chemoinformatics: 26 CIC-Workshop from Chemistry Central

I will pull several of the abstracts for fuller treatment but whatever I choose, I will miss the very abstract of interest to you.

Moreover, the link at the top of this post takes you to all the “supplements” from Chemistry Central.

I am sure you will find a wealth of information.

Open Data for Africa Launched by AfDB

Thursday, March 28th, 2013

Open Data for Africa Launched by AfDB

From the post:

The African Development Bank Group has recently launched the ‘Open Data for Africa‘ as part of the bank’s goal to improve data management and dissemination in Africa. The Open Data for Africa is a user friendly tool for extracting data, creating and sharing own customized reports, and visualising data across themes, sectors and countries in tables, charts and maps. The platform currently holds data from 20 African countries : Algeria, Cameroon, Cape Verde, Democratic Republic of Congo, Ethiopia, Malawi, Morocco, Mozambique, Namibia, Nigeria, Ghana, Rwanda, Republic of Congo, Senegal, South Africa, South Sudan, Tanzania, Tunisia, Zambia and Zimbabwe.

Not a lot of resources but a beginning.

One trip to one country isn’t enough to form an accurate opinion of a continent but I must report my impression of South Africa from several years ago.

I was at a conference with mid-level government and academic types for a week.

In a country where “child head of household” is a real demographic category, I came away deeply impressed with the optimism of everyone I met.

You can just imagine the local news in the United States and/or Europe if a quarter of the population was dying.

Vows of to “…never let this happen again…,” blah, blah, would choke the channels.

Not in South Africa. They readily admit to having a variety of serious issues but are equally serious about developing ways to meet those challenges.

If you want to see optimism in the face of stunning odds, I would strongly recommend a visit.

Biodiversity Heritage Library (BHL)

Thursday, March 28th, 2013

Biodiversity Heritage Library (BHL)

Best described by their own “about” page:

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts , the BHL has digitized millions of pages of taxonomic literature , representing tens of thousands of titles and over 100,000 volumes.

The published literature on biological diversity has limited global distribution; much of it is available in only a few select libraries in the developed world. These collections are of exceptional value because the domain of systematic biology depends, more than any other science, upon historic literature. Yet, this wealth of knowledge is available only to those few who can gain direct access to significant library collections. Literature about the biota existing in developing countries is often not available within their own borders. Biologists have long considered that access to the published literature is one of the chief impediments to the efficiency of research in the field. Free global access to digital literature repatriates information about the earth’s species to all parts of the world.

The BHL consortium members digitize the public domain books and journals held within their collections. To acquire additional content and promote free access to information, the BHL has obtained permission from publishers to digitize and make available significant biodiversity materials that are still under copyright.

Because of BHL’s success in digitizing a significant mass of biodiversity literature, the study of living organisms has become more efficient. The BHL Portal allows users to search the corpus by multiple access points, read the texts online, or download select pages or entire volumes as PDF files.

The BHL serves texts with information on over a million species names. Using UBio’s taxonomic name finding tools, researchers can bring together publications about species and find links to related content in the Encyclopedia of Life. Because of its commitment to open access, BHL provides a range of services and APIs which allow users to harvest source data files and reuse content for research purposes.

Since 2009, the BHL has expanded globally. The European Commission’s eContentPlus program has funded the BHL-Europe project, with 28 institutions, to assemble the European language literature. Additionally, the Chinese Academy of Sciences (BHL-China), the Atlas of Living Australia (BHL-Australia), Brazil (through BHL-SciELO) and the Bibliotheca Alexandrinahave created national or regional BHL nodes. Global nodes are organizational structures that may or may not develop their own BHL portals. It is the goal of BHL to share and serve content through the BHL Portal developed and maintained at the Missouri Botanical Garden. These projects will work together to share content, protocols, services, and digital preservation practices.

A truly remarkable effort!

Would you believe they have a copy of “Aristotle’s History of animals.” In ten books. Tr. by Richard Cresswell? For download as a PDF?

Tell me, how would you reconcile the terminology of Aristotle or of Cresswell for that matter in translation, with modern terminology both for species and their features?

In order to enable navigation from this work to other works in the collection?

Moreover, how would you preserve that navigation for others to use?

Document level granularity is better than not finding a document at all but it is a far cry from being efficient.

BHL-Europe web portal opens up…

Thursday, March 28th, 2013

BHL-Europe web portal opens up the world’s knowledge on biological diversity

From the post:

The goal of the Biodiversity Heritage Library for Europe (BHL-Europe) project is to make published biodiversity literature accessible to anyone who’s interested. The project will provide a multilingual access point (12 languages) for biodiversity content through the BHL-Europe web portal with specific biological functionalities for search and retrieval and through the EUROPEANA portal. Currently BHL-Europe involves 28 major natural history museums, botanical gardens and other cooperating institutions.

BHL-Europe is a 3 year project, funded by the European Commission under the eContentplus programme, as part of the i2010 policy.

Unlimited access to biological diversity information

The libraries of the European natural history museums and botanical gardens collectively hold the majority of the world’s published knowledge on the discovery and subsequent description of biological diversity. However, digital access to this knowledge is difficult.

The BHLproject, launched 2007 in the USA, is systematically attempting to address this problem. In May 2009 the ambitious and innovative EU project ‘Biodiversity Heritage Library for Europe’ (BHL-Europe) was launched. BHL-Europe is coordinated by the Museum für Naturkunde Berlin, Germany, and combines the efforts of 26 European and 2 American institutions. For the first time, the wider public, citizen scientists and decision makers will have unlimited access to this important source of information.

A project with enormous potential, although three (3) years seems a bit short.

Mentioned but without a link, the BHLproject has digitized over 100,000 volumes, with information on more than one million species names.

Let’s do this the hard way [Topic Map Security]

Thursday, March 28th, 2013

Let’s do this the hard way by Edd Dumbill.

Discovery of high profile security vulnerabilities (Rails, MongoDB) caused Edd to pen this suggestion for software security:

But perhaps we are in need of an inversion of philosophy. Where Internet programming is concerned, everyone is quick to quote Postel’s law: “Be conservative in what you do, be liberal in what you accept from others.”

The fact of it is that being liberal in what you accept is really hard. You basically have two options: look carefully for only the information you need, which I think is the spirit of Postel’s law, or implement something powerful that will take care of many use cases. This latter strategy, though seemingly quicker and more future-proof, is what often leads to bugs and security holes, as unintended applications of powerful parsers manifest themselves.

My conclusion is this: use whatever language makes sense, but be systematically paranoid. Be liberal in what you accept, but conservative about what you believe.

Which raises the little noticed question of topic map security.

Take for instance, if you are using the TMDM model for a topic map and someone submits the topic map equivalent of “spam.” That is a topic that has the same subject identifier as some legitimate topic in your map but it is an ad to get you into “bikini shape.”

My inbox has seen a rash of those lately. I shudder to think what I would look like in “bikini shape.” It would be good for others, not so much for me. 😉

Or a topic that has a set of subject identifiers that causes merging between topics that should not be merged. Possibly overloading your system or at the very least, causing a disruption to your users.

There are no standard solutions to topic map security although I suspect some users/vendors have hand crafted their own.

To be taken seriously in these security conscious times, I think we need to extend the topic maps standard to provide for topic map security.

Suggestions and proposals welcome!


Thursday, March 28th, 2013


From the webpage:

Elephant is an S3-backed key-value store with querying powered by Elastic Search. Your data is persisted on S3 as simple JSON documents, but you can instantly query it over HTTP.

Suddenly, your data becomes as durable as S3, as portable as JSON, and as queryable as HTTP. Enjoy!

i don’t recall seeing Elephant on the Database Landscape Map – February 2013. Do you?

Every database is thought, at least by its authors, to be different from all the others.

What dimensions would be the most useful ones for distinction/comparison?


I first saw this in Nat Torkington’s Four short links: 27 March 2013.

….Like A Child’s Story Book [Visual Storytelling]

Wednesday, March 27th, 2013

Articulating Your Content Strategy Like A Child’s Story Book by Michael Brito.

From the post:

I used to read “Love You Forever” to both of my girls when they were little. Even thinking about it today, I still get choked up. It’s really a heartfelt story. What I remember the most about it is that it uses imagery to tell a very significant story (as with most children’s books). The story is about a mother’s unconditional love for her son; and then chronicles her son’s life growing to an adult and starting his own family. The sad conclusion shows how he reciprocates his love to his mother who has grown to be an elderly woman. There are just a few sentences on each page but the story and illustration is powerful and you can even follow along without even reading the text.

Michael makes a great case for visual storytelling and includes a Slideshare presentation by Stefanos Karagos to underline his point.

Before you view the slides!

Ask yourself what percent of users have a great experience with your product?

The slides reveal what percent of users share your opinion.

I doubt you have noticed that I am really a “text” sort of person. 😉

The lesson here isn’t any more foreign to you than it is to me.

But I think the author has a very good point, assuming our goal is to communicate with others.

Can’t communicate with others as we would like for them to be.

At least not successfully.

Esri Geometry API

Wednesday, March 27th, 2013

Esri Geometry API

From the webpage:


The Esri Geometry API for Java can be used to enable spatial data processing in 3rd-party data-processing solutions. Developers of custom MapReduce-based applications for Hadoop can use this API for spatial processing of data in the Hadoop system. The API is also used by the Hive UDF’s and could be used by developers building geometry functions for 3rd-party applications such as Cassandra, HBase, Storm and many other Java-based “big data” applications.


  • API methods to create simple geometries directly with the API, or by importing from supported formats: JSON, WKT, and Shape
  • API methods for spatial operations: union, difference, intersect, clip, cut, and buffer
  • API methods for topological relationship tests: equals, within, contains, crosses, and touches

This looks particularly useful for mapping the rash of “public” data sets to facts on the ground.

Particularly if income levels, ethnicity, race, religion and other factors are taken into account.

Might give more bite to the “excess population,” aka the “47%” people speak so casually about.

Additional resources:

ArcGIS Geodata Resource Center

ArcGIS Blog


Apache Tajo

Wednesday, March 27th, 2013

Apache Tajo

From the webpage:


Tajo is a relational and distributed data warehouse system for Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation and ETL on large-data sets by leveraging advanced database techniques. It supports SQL standards. Tajo uses HDFS as a primary storage layer and has its own query engine which allows direct control of distributed execution and data flow. As a result, Tajo has a variety of query evaluation strategies and more optimization opportunities. In addition, Tajo will have a native columnar execution and and its optimizer.


  • Fast and low-latency query processing on SQL queries including projection, filter, group-by, sort, and join.
  • Rudiment ETL that transforms one data format to another data format.
  • Support various file formats, such as CSV, RCFile, RowFile (a row store file), and Trevni.
  • Command line interface to allow users to submit SQL queries
  • Java API to enable clients to submit SQL queries to Tajo

If you ever wanted to get in on the ground floor of a data warehouse project, this could be your chance!

I first saw this at ‎Apache Incubator: Tajo – a Relational and Distributed Data Warehouse for Hadoop by Alex Popescu.

Mumps: The Proto-Database…

Wednesday, March 27th, 2013

Mumps: The Proto-Database (Or How To Build Your Own NoSQL Database) by Rob Tweed.

From the post:

I think that one of the problems with Mumps as a database technology, and something that many people don’t like about the Mumps database is that it is a very basic and low-level engine, without any of the frills and value-added things that people expect from a database these days. A Mumps database doesn’t provide built-in indexing, for example, nor does it have any high-level query language (eg SQL, Map/Reduce) built in, though there are add-on products that can provide such capabilities.

On the other hand, a raw Mumps database, such as GT.M, is actually an interesting beast, as it turns out to provide everything you need to design and create your own NoSQL (or pretty much any other kind of) database. As I’ve discussed and mentioned a number of times in these articles, it’s a Universal NoSQL engine.

Why, you might ask, would you want to create your own NoSQL database? I’d possibly agree, but there hardly seems to be a week go by without someone doing exactly that and launching yet another NoSQL database. So, there’s clearly a perceived need or desire to do so.

I first saw this at Mumps: The Proto-Database by Alex Popescu.

Alex asks:

The question I’d ask myself is not “why would I build another NoSQL database”, but rather “why none of the popular ones are built using Mumps?”.

I suspect the answer is the same one for why are popular NoSQL databases, such as MongoDB, are re-inventing text indexing? (see MongoDB 2.4 Release)

Database Landscape Map – February 2013

Wednesday, March 27th, 2013

Database Landscape Map – February 2013 by 451 Research.

Database map

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.


The Three Y’s of Topic Maps

Wednesday, March 27th, 2013

Thinking of ways that topic maps are the same or different from other information technologies.

Your Data: I think all information technologies would claim to handle your data. Some focus more on structured data that others but in general, all handle “your data.”

Your Model: This is where topic maps and key/value stores depart from the Semantic Web.

You don’t get “your model” with the Semantic Web, you get a prefab logical model.

Contrast that with topic map where you can have FOL (first order logic), SOL (second order logic) or any other logic or non-logic you choose to have. It’s your model and it operates as you think it should. Could even be: “go ask Steve” for some operations.

The Semantic Web types will protest not using their model means your data won’t work with their software. Which is one of their main reasons for touting their model. It works with their software.

Personally I prefer models that fit my use cases. As opposed to models whose first requirement is to work on a particular class of software.

Guess it depends on whether you want to further the well being of Semantic Web software developers or your own.

Your Vocabulary: Another point where topic maps and key/value stores depart from the Semantic Web.

The vocabulary you choose for your model is your own.

Which is very likely to be more familiar and you can apply it more accurately.

How do topic maps differ from key/value stores?

Two things come to mind. First, topic maps have inherent machinery for the representation of relationships between subjects.

Not that you could not do that with a key/value store but in some sense a key/value store is more primitive than a topic map. You would have to build up such structures for yourself.

Second, “as is,” key/value stores (at least the ones I have seen, which isn’t all of them), don’t have a well developed notion of subject identity.

That is keys and values are both treated as primitives. If your key and my key aren’t the same, then they must be different. Or if they are the same, then they must be the same thing. Same for values.

That may not be a disadvantage in some cases where information aggregation or merging isn’t a requirement. But it is becoming harder and harder to think of use cases where aggregation/merging isn’t ever going to be an issue.

I need to cover issues like the differences between topic maps and key/value stores more fully.

Would you be interested in longer pieces that could eventually form a book on topic maps?

Perhaps even by subscription?

Drake [Data Processing Workflow]

Wednesday, March 27th, 2013


From the webpage:

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

  • which commands to execute (based on file timestamps)
  • in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

The video demonstrating Drake is quite good.

Granting my opinion may be influenced by the use of awk in the early examples. 😉

Definitely a tool for scripted production of topic maps.

I first saw this in a tweet by Chris Diehl.

Tensor Decompositions and Applications

Tuesday, March 26th, 2013

Tensor Decompositions and Applications by Tamara G. Kolda and Brett W. Bader.


This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or N-way array. Decompositions of higher-order tensors (i.e., N-way arrays with N ≥ 3) have applications in psychometrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition:CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.

At forty-five pages and two hundred and forty-five (245) references, this is a broad survey of tensor decompostion with numerous pointers to other survey and more specialized works.

I found this shortly after discovering the post I cover in: Tensors and Their Applications…

As I said in the earlier post, this has a lot of promise.

Although it isn’t yet clear to me how you would compare/contrast tensors with different dimensions and perhaps even a different number of dimensions.

Still, a lot of reading to do so perhaps I haven’t reached that point yet.

If you want to talk about the weather…

Tuesday, March 26th, 2013

Forecast for Developers

From the webpage:

The same API that powers and Dark Sky for iOS can provide accurate short­term and long­term weather predictions to your business, application, or crazy idea.

We’re developers too, and we like playing with new APIs, so we want you to be able to try ours hassle-free: all you need is an email address.

First thousand API calls a day are free.

Every 10,000 API calls after that are $1.

It could be useful/amusing to merge personal weather observations based on profile characteristics.

Like a recommendation system except for how you are going to experience the weather.

Our Internet Surveillance State [Intelligence Spam]

Tuesday, March 26th, 2013

Our Internet Surveillance State by Bruce Schneier.

Nothing like a good rant to get your blood pumping during a snap of cold weather! 😉

Bruce writes:

Maintaining privacy on the Internet is nearly impossible. If you forget even once to enable your protections, or click on the wrong link, or type the wrong thing, and you’ve permanently attached your name to whatever anonymous service you’re using. Monsegur slipped up once, and the FBI got him. If the director of the CIA can’t maintain his privacy on the Internet, we’ve got no hope.

In today’s world, governments and corporations are working together to keep things that way. Governments are happy to use the data corporations collect — occasionally demanding that they collect more and save it longer — to spy on us. And corporations are happy to buy data from governments. Together the powerful spy on the powerless, and they’re not going to give up their positions of power, despite what the people want.

And welcome to a world where all of this, and everything else that you do or is done on a computer, is saved, correlated, studied, passed around from company to company without your knowledge or consent; and where the government accesses it at will without a warrant.

Welcome to an Internet without privacy, and we’ve ended up here with hardly a fight.

I don’t disagree with anything Bruce writes but I do not counsel despair.

Nor would I suggest any stop using the “Internet, email, cell phones, web browser, social networking sites, search engines,” in order to avoid spying.

But remember that one of the reasons U.S. intelligence services have fallen on hard times is the increased reliance on “easy” data to collect.

Clipping articles from newspaper or now copy-n-paste from emails and online zines, isn’t the same as having culturally aware human resources on the ground.

“Easy” data collection is far cheaper, but also less effective.

My suggestion is that everyone go “bare” and load up all listeners with as much junk as humanly possible.

Intelligence “spam” as it were.

Routinely threaten to murder fictitious characters in books or conspire to kidnap them. Terror plots, threats against Alderaan, for example.

Apparently even absurd threats, ‘One Definition of “Threat”,’ cannot be ignored.

A proliferation of fictional threats will leave them too little time to spy people going about their lawful activities.

BTW, not legal advice but I have heard that directly communicating any threat to any law enforcement agency is a crime. And not a good idea in any event.

Nor should you threaten any person or place or institution that isn’t entirely and provably fictional.

When someone who thinks mining social networks sites is a blow against terrorism overhears DC comic characters being threatened, that should be enough.

Tuesday, March 26th, 2013 by Daniel Edler and Martin Rosvall.

From the “about” page:

What do we do?

We develop mathematics, algorithms and software to simplify and highlight important structures in complex systems.

What are our goals?

To navigate and understand big data like we navigate and understand the real world by maps.

Suggest you start with the Apps.

Very impressive and has data available for loading.

You can also upload your own data.

Spend some time with Code and Publications as well.

I first saw this in a tweet by Chris@SocialTexture.

Analyzing Twitter Data with Apache Hadoop, Part 3:…

Tuesday, March 26th, 2013

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive by Jon Natkins.

From the post:

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

I didn’t realize I had missed this part of the Hive series until I saw it mentioned in the Hue post.

Good introduction to Hive.

BTW, is Twitter data becoming the “hello world” of data mining?

How-to: Analyze Twitter Data with Hue

Tuesday, March 26th, 2013

How-to: Analyze Twitter Data with Hue by Romain Rigaux.

From the post:

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

The Hadoop ecosystem continues to improve!

Question: Is anyone keeping a current listing/map of the various components in the Hadoop ecosystem?


Tuesday, March 26th, 2013


From the webpage:


Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.


With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

Massive online data stream mining with R

Tuesday, March 26th, 2013

Massive online data stream mining with R

From the post:

A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

The stream package is currently focussed on clustering algorithms available in MOA ( and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list. Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.

What if data were always encountered as a stream?

Could request a “re-streaming” of data but best to do analysis in one streaming.

How would that impact your notion of subject identity?

How would you compensate for information learned later in the stream?

For the sake of 175?

Tuesday, March 26th, 2013

Out of Sight, Out of Mind

An interactive graphic depicting results of U.S. drone strikes in Pakistan since 2004.

In order to kill 47 targets, drone attacks have also killed 175 children, 535 civilians, 2349 “others.”

Highly effective graphic in a number of ways.

Try the Attacks, Victims, News and Info links in the upper left, or mouse over the individual attacks.

When your topic map presents information this effectively, you will be on the road to success!

PS: For policy wonks, only ten (10) innocents were required at Sodom and Gomorrah to avoid destruction.

Or to put it differently, would you murder on average more than three (3) children to kill on terrorist target? For President Obama, that answer is yes.

I first saw this at Nathan Yau’s Every known drone attack in Pakistan.

Wanted: Evaluators to Try MongoDB with Fractal Tree Indexing

Tuesday, March 26th, 2013

Wanted: Evaluators to Try MongoDB with Fractal Tree Indexing by Tim Callaghan.

From the post:

We recently resumed our discussion around bringing Fractal Tree indexes to MongoDB. This effort includes Tokutek’s interview with Jeff Kelly at Strata as well as my two recent tech blogs which describe the compression achieved on a generic MongoDB data set and performance improvements we measured using on our implementation of Sysbench for MongoDB. I have a full line-up of benchmarks and blogs planned for the next few months, as our project continues. Many of these will be deeply technical and written by the Tokutek developers.

We have a group of evaluators running MongoDB with Fractal Tree Indexes, but more feedback is always better. So …

Do you want to participate in the process of bringing high compression and extreme performance gains to MongoDB? We’re looking for MongoDB experts to test our build on your real-world workloads and benchmarks. Evaluator feedback will be used in creating the product road map. Please email me at if interested.

You keep reading about the performance numbers on MongoDB.

Aren’t you curious if those numbers are true for your use case?

Here’s your opportunity to find out!

Master Indexing and the Unified View

Monday, March 25th, 2013

Master Indexing and the Unified View by David Loshin.

From the post:

1) Identity resolution – The master data environment catalogs the set of representations that each unique entity exhibits in the original source systems. Applying probabilistic aggregation and/or deterministic rules allows the system to determine that the data in two or more records refers to the same entity, even if the original contexts are different.

2) Data quality improvement – Linking records that share data about the same real-world entity enable the application of business rules to improve the quality characteristics of one or more of the linked records. This doesn’t specifically mean that a single “golden copy” record must be created to replace all instances of the entity’s data. Instead, depending on the scenario and quality requirements, the accessibility of the different sources and the ability to apply those business rules at the data user’s discretion will provide a consolidated view that best meets the data user’s requirements at the time the data is requested.

3) Inverted mapping – Because the scope of data linkage performed by the master index spans the breadth of both the original sources and the collection of data consumers, it holds a unique position to act as a map for a standardized canonical representation of a specific entity to the original source records that have been linked via the identity resolution processes.

In essence this allows you to use a master data index to support federated access to original source data while supporting the application of data quality rules upon delivery of the data.

It’s been a long day but does David’s output have all the attributes of a topic map?

  1. Identity resolution – Two or more representatives the same subject
  2. Data quality improvement – Consolidated view of the data based on a subject and presented to the user
  3. Inverted mapping – Navigation based on a specific entity into original source records


5 Pitfalls To Avoid With Hadoop

Monday, March 25th, 2013

5 Pitfalls To Avoid With Hadoop by Syncsort, Inc.

From the registration page:

Hadoop is a great vehicle to extract value from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop to achieve a complete ETL solution can hinder success.

Syncsort has worked with early adopter Hadoop customers to identify and solve the most common pitfalls organizations face when deploying ETL on Hadoop.

  1. Hadoop is not a data integration tool
  2. MapReduce programmers are hard to find
  3. Most data integration tools don’t run natively within Hadoop
  4. Hadoop may cost more than you think
  5. Elephants don’t thrive in isolation

Before you give up your email and phone number for the “free ebook,” be aware it is a promotional piece for Syncsort DMX-h.

Which isn’t a bad thing but if you are expecting something different, you will be disappointed.

The observations are trivially true and amount to Hadoop not having a user facing interface, pre-written routines for data integration and tools that data integration users normally expect.

OK, but a hammer doesn’t come with blueprints, nails, wood, etc., but those aren’t “pitfalls.”

It’s the nature of a hammer that those “extras” need to be supplied.

You can either do that piecemeal or you can use a single source (the equivalent of Syncsort DMX-h).

Syncsort should be on your short list of data integration options to consider but let’s avoid loose talk about Hadoop. There is enough of that in the uninformed main stream media.

Implementing the RAKE Algorithm with NLTK

Monday, March 25th, 2013

Implementing the RAKE Algorithm with NLTK by Sujit Pal.

From the post:

The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn’t improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK’s built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

Reminds me of the “statistically insignificant phrases” at Amazon. Or was that “statistically improbable phrases?”

If you search on “statistically improbable phrases,” you get twenty (20) “hits” under books at

Could be a handy tool to quickly extract candidates for topics in a topic map.

The Tallinn Manual [Laws of War & Topic Maps]

Monday, March 25th, 2013

The Tallinn Manual

From the webpage:

The Tallinn Manual on the International Law Applicable to Cyber Warfare, written at the invitation of the Centre by an independent ‘International Group of Experts’, is the result of a three-year effort to examine how extant international law norms apply to this ‘new’ form of warfare. The Tallinn Manual pays particular attention to the jus ad bellum, the international law governing the resort to force by States as an instrument of their national policy, and the jus in bello, the international law regulating the conduct of armed conflict (also labelled the law of war, the law of armed conflict, or international humanitarian law). Related bodies of international law, such as the law of State responsibility and the law of the sea, are dealt within the context of these topics.

The Tallinn Manual is not an official document, but instead an expression of opinions of a group of independent experts acting solely in their personal capacity. It does not represent the views of the Centre, our Sponsoring Nations, or NATO. It is also not meant to reflect NATO doctrine. Nor does it reflect the position of any organization or State represented by observers.

So you don’t run afoul of the laws of war with any of your topic map activities.

I first saw this in Nat Torkington’s Four short links: 22 March 2013.

I would normally credit his source but they say:

All rights reserved. This material may not be published, broadcast, rewritten or redistributed.

So I can’t tell you the name of the resource or its location. Sorry.

I did include the direct URL to the Tallinn Manual, which isn’t covered by their copyright.

PS: Remember “war crimes” are defined post-hoc by the victors so choose your side carefully.

Congratulations! You’re Running on OpenCalais 4.7!

Monday, March 25th, 2013

Congratulations! You’re Running on OpenCalais 4.7!

From the post:

This morning we upgraded OpenCalais to release 4.7. Our focus with 4.7 was on a significant improvement in the detection and disambiguation of companies as well as some behind-the-scenes tune-ups and bug fixes.

If your content contains company names you should already be seeing a significant improvement in detection and disambiguation. While company detection has always been very good in OpenCalais, now it’s great.

If you’re one of our high-volume commercial clients (1M+ transactions per day), we’ll be rolling out your upgrade toward the end of the month.

And, remember, you can always drop by the OpenCalais viewer for a quick test or exploration of OpenCalais with zero programming involved.

If you don’t already know OpenCalais:

From a user perspective it’s pretty simple: You hand the Web Service unstructured text (like news articles, blog postings, your term paper, etc.) and it returns semantic metadata in RDF format. What’s happening in the background is a little more complicated.

Using natural language processing and machine learning techniques, the Calais Web Service examines your text and locates the entities (people, places, products, etc.), facts (John Doe works for Acme Corporation) and events (Jane Doe was appointed as a Board member of Acme Corporation). Calais then processes the entities, facts and events extracted from the text and returns them to the caller in RDF format.

Please also check out the Calais blog and forums to see where Calais is headed. Significant development activities include the ability for downstream content consumers to retrieve previously generated metadata using a Calais-provided GUID, additional input languages, and user-defined processing extensions.

Did I mention it is a free service up to 50,000 submissions a day? (see the license terms for details)

OpenCalais won’t capture every entity or relationship known to you but it will do a lot of the rote work for you. You can then fill in the specialized parts.