Archive for October, 2013

Integrating Nutch 1.7 with ElasticSearch

Saturday, October 26th, 2013

Integrating Nutch 1.7 with ElasticSearch

From the post:

With Nutch 1.7 the possibility for integrating with ElasticSearch became available. However setting up the integration turned out to be quite a treasure hunt for me. For anybody else wanting to achieve the same result without tearing out as much hair as I did please find some simple instructions on this page that hopefully will help you in getting Nutch to talk to ElasticSearch.

I’m assuming you have both Nutch and ElasticSearch running fine by which I mean that Nutch does it crawl, fetch, parse thing and ElasticSearch is doing its indexing and searching magic, however not yet together.

All of the work involved is in Nutch and you need to edit nutch-site.xml in the conf directory to get things going. First off you need to activate the elasticsearch indexer plugin by adding the following line to nutch-site.xml:

A post that will be much appreciated by anyone who wants to integrate Nutch with ElasticSearch.

A large number of software issues are matters of configuration, once you know the configuration.

The explorers who find those configurations and share them with others are under appreciated.


Saturday, October 26th, 2013

Sylva: A Relaxed Schema Graph Database Management System

From the webpage:

Sylva [from the Old Spanish “silva”, a Rennaisance type of book to organize knowledge] is a free, easy-to-use, flexible, and scalable database management system that helps you collect, collaborate, visualize and query large data sets.

No programming knowledge is required to use Sylva!

The data sets are stored according to the schemas (graphs) created by the user, and can be visualized as networks and as lists. Researchers have absolute freedom to grant different permits to collaborators and to import and export their schemas and data sets.

Start using Sylva as soon as it becomes available!

Not available, yet, but putting it on a watch list.

The splash page reads as though graph operations can be restricted by a schema.

That would be an interesting capability.

In particular if workflow could be modeled by a schema.

Little Book of R for Bioinformatics!

Saturday, October 26th, 2013

Welcome to a Little Book of R for Bioinformatics! by Avril Coghlan.

From the webpage:

This is a simple introduction to bioinformatics, with a focus on genome analysis, using the R statistics software.

To encourage research into neglected tropical diseases such as leprosy, Chagas disease, trachoma, schistosomiasis etc., most of the examples in this booklet are for analysis of the genomes of the organisms that cause these diseases.

A Little Book of R For Bioinformatics, Release 01 (PDF)

Seventy-seven pages, including answers to exercises that should get you going with R in bioinformatics.

A Datalog Engine for GPUs A Datalog Engine for GPUs

Saturday, October 26th, 2013

A Datalog Engine for GPUs A Datalog Engine for GPUs by Carlos Alberto Martinez-Angeles, Ines Dutra, Vitor Santos Costa, Jorge Buenabad-Chavez. (Caution: The link points to a paper within the proceedings for Kiel Declarative Programming Days 2013. Lots of other good papers and runs about 5.5MB)


We present the design and evaluation of a Datalog engine for execution in Graphics Processing Units (GPUs). The engine evaluates recursive and non-recursive Datalog queries using a bottom-up approach based on typical relational operators. It includes a memory management scheme that automatically swaps data between memory in the host platform (a multicore) and memory in the GPU in order to reduce the number of memory transfers. To evaluate the performance of the engine, three Datalog queries were run on the engine and on a single CPU in the multicore host. One query runs up to 200 times faster on the (GPU) engine than on the CPU.

Recalling that Lars Marius Garshol based tolog (a topic map query language) on Datalog.

Excel the Ultimate

Saturday, October 26th, 2013

Excel the Ultimate by Christophe Grand.

I know. That was my reaction when I saw the tweet from Christophe.

On my Clojure feed no less!

Still, I was recovering from working out in the yard. If I was confused when I read something about Excel, no harm, no foul. 😉

Went back to it first after I rested up. You don’t want to be confused for this one.

It’s only twenty-four clicks to get through the slides. View them first.

Then take this link,, where you will find “spreadmap.”

What is spreadmap?

Evil Clojure library to turn Excel spreadsheets in persistent reactive associative structures.

I concede that Excel spreadsheets do not give voice to the powerless, access to the disenfranchised, or work as a force for democracy around the world.

Neither does the NSA. What was your question again?

Massive Query Expansion by Exploiting Graph Knowledge Bases

Friday, October 25th, 2013

Massive Query Expansion by Exploiting Graph Knowledge Bases by Joan Guisado-Gámez, David Dominguez-Sal, Josep-LLuis Larriba-Pey.


Keyword based search engines have problems with term ambiguity and vocabulary mismatch. In this paper, we propose a query expansion technique that enriches queries expressed as keywords and short natural language descriptions. We present a new massive query expansion strategy that enriches queries using a knowledge base by identifying the query concepts, and adding relevant synonyms and semantically related terms. We propose two approaches: (i) lexical expansion that locates the relevant concepts in the knowledge base; and, (ii) topological expansion that analyzes the network of relations among the concepts, and suggests semantically related terms by path and community analysis of the knowledge graph. We perform our expansions by using two versions of the Wikipedia as knowledge base, concluding that the combination of both lexical and topological expansion provides improvements of the system’s precision up to more than 27%.

Heavy reading for the weekend but this paragraph caught my eye:

In this paper, we propose a novel query expansion which uses a hybrid input in the form of a query phrase and a context, which are a set of keywords and a short natural language description of the query phrase, respectively. Our method is based on the combination of both a lexical and topological analysis of the concepts related to the input in the knowledge base. We differ from previous works because we are not considering the links of each article individually, but we are mining the global link structure of the knowledge base to find related terms using graph mining techniques. With our technique we are able to identify: (i) the most relevant concepts and their synonyms, and (ii) a set of semantically related concepts. Most relevant concepts provide equivalent reformulations of the query that reduce the vocabulary mismatch. Semantically related concepts introduce many different terms that are likely to appear in a relevant document, which is useful to solve the lack of topic expertise and also disambiguate the keywords.

Wondering that since it works with Wikipedia, should the same be true for the references but not hyperlinks of traditional publishing?

Say documents in Citeseer for example?

Nothing against Wikipedia but general knowledge doesn’t have a very high retail value.

Collection Aliasing:…

Friday, October 25th, 2013

Collection Aliasing: Near Real-Time Search for Really Big Data by Mark Miller.

From the post:

The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.

Eventually, we built Cloudera Search on Solr and Apache Lucene, both of which have been adding features at an ever-faster pace to aid in handling more and more data. However, there is no silver bullet for dealing with extremely large-scale data. A common answer in the world of search is “it depends,” and that answer applies in large-scale search as well. The right architecture for your use case depends on many things, and your choice will generally be guided by the requirements and resources for your particular project.

We wanted to make sure that one simple scaling strategy that has been commonly used in the past for large amounts of time-series data would be fairly simple to set up with Cloudera Search. By “time-series data,” I mean logs, tweets, news articles, market data, and so on — data that is continuously being generated and is easily associated with a current timestamp.

One of the keys to this strategy is a feature that Cloudera recently contributed to Solr: collection aliasing. The approach involves using collection aliases to juggle collections in a very scalable little “dance.” The architecture has some limitations, but for the right use cases, it’s an extremely scalable option. I also think there are some areas of the dance that we can still add value to, but you can already do quite a bit with the current functionality.

A great post if you have really big data. 😉

Seriously, it is a great post and introduction to collection aliases.

On the other hand, I do wonder what routine Abbot and Costello would do with the variations on big, bigger, really big, etc., data.

Suggestions welcome!


Friday, October 25th, 2013


From the webpage:

WhiteDB is a lightweight NoSQL database library written in C, operating fully in main memory. There is no server process. Data is read and written directly from/to shared memory, no sockets are used between WhiteDB and the application program.

This look like fun!

Not to mention being a reason to bump up to more memory. 😉

From Text to Truth:…

Friday, October 25th, 2013

From Text to Truth: Real-World Facets for Multilingual Search by Benson Margulies.


Solr’s ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn’t want a “George Bush” facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for “George W. Bush” or even “乔治·沃克·布什” (a Chinese translation) that are limited to just one string. We’ll explore the benefits and challenges of empowering Solr users with real-world facets.

One of the better conference presentations I have seen in quite some time.

This is likely to change your mind about how you think about facets. Or at least how to construct them.

If you think of facets as the decoration you see at ecommerce sites, think again.


Apache Lucene and Solr 4.5.1 (bugfix)

Friday, October 25th, 2013

Apache Lucene and Solr 4.5.1

From the post:

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.5.1. This is a minor bugfix release.

Apache Lucene 4.5.1 library can be downloaded from the following address: Apache Solr 4.5.1 can be downloaded at the following URL address: Release note for Apache Lucene 4.5.1 can be found at:, Solr release notes can be found at:

Without a “tech surge” no less.

Database and Query Analysis Tools for MySQL:…

Thursday, October 24th, 2013

Database and Query Analysis Tools for MySQL: Exploiting Hypertree and Hypergraph Decompositions by Selvameenal Chokkalingam.


A database is an organized collection of data. Database systems are widely used and have a broad range of applications. It is thus essential to find efficient database query evaluation techniques. In the recent years, new theories and algorithms for database query optimization have been developed that exploit advanced graph theoretic concepts. In particular, the graph theoretic concepts of hypergraphs, hypergraph decompositions, and hypertree decompositions have played an important role in the recent research.

This thesis studies algorithms that employ hypergraph decompositions in order to detect the cyclic or acyclic degree of database schema, and describes implementations of those algorithms. The main contribution of this thesis is a collection of software tools for MySQL that exploit hypergraph properties associated with database schema and query structures.

If you remember hypergraphs from database theory this may be the refresher for you.

I stumbled across it earlier today while running down references on hypergraphs.

Hypergraphs and Cellular Networks

Thursday, October 24th, 2013

Hypergraphs and Cellular Networks by Steffen Klamt, Utz-Uwe Haus, Fabian Theis. (Klamt S, Haus U-U, Theis F (2009) Hypergraphs and Cellular Networks. PLoS Comput Biol 5(5): e1000385. doi:10.1371/journal.pcbi.1000385)


The understanding of biological networks is a fundamental issue in computational biology. When analyzing topological properties of networks, one often tends to substitute the term “network” for “graph”, or uses both terms interchangeably. From a mathematical perspective, this is often not fully correct, because many functional relationships in biological networks are more complicated than what can be represented in graphs.

In general, graphs are combinatorial models for representing relationships (edges) between certain objects (nodes). In biology, the nodes typically describe proteins, metabolites, genes, or other biological entities, whereas the edges represent functional relationships or interactions between the nodes such as “binds to”, “catalyzes”, or “is converted to”. A key property of graphs is that every edge connects two nodes. Many biological processes, however, are characterized by more than two participating partners and are thus not bilateral. A metabolic reaction such as A+B→C+D (involving four species), or a protein complex consisting of more than two proteins, are typical examples. Hence, such multilateral relationships are not compatible with graph edges. As illustrated below, transformation to a graph representation is usually possible but may imply a loss of information that can lead to wrong interpretations afterward.

Hypergraphs offer a framework that helps to overcome such conceptual limitations. As the name indicates, hypergraphs generalize graphs by allowing edges to connect more than two nodes, which may facilitate a more precise representation of biological knowledge. Surprisingly, although hypergraphs occur ubiquitously when dealing with cellular networks, their notion is known to a much lesser extent than that of graphs, and sometimes they are used without explicit mention.

This contribution does by no means question the importance and wide applicability of graph theory for modeling biological processes. A multitude of studies proves that meaningful biological properties can be extracted from graph models (for a review see [1]). Instead, this contribution aims to increase the communities’ awareness of hypergraphs as a modeling framework for network analysis in cell biology. We will give an introduction to the notion of hypergraphs, thereby highlighting their differences from graphs and discussing examples of using hypergraph theory in biological network analysis. For this Perspective, we propose using hypergraph statistics of biological networks, where graph analysis is predominantly used but where a hypergraph interpretation may produce novel results, e.g., in the context of a protein complex hypergraph.

Like graphs, hypergraphs may be classified by distinguishing between undirected and directed hypergraphs, and, accordingly, we divide the introduction to hypergraphs given below into two major parts.

When I read:

Many biological processes, however, are characterized by more than two participating partners and are thus not bilateral.

I am reminded of Steve Newcomb’s multilateral models of subject identity.

Possibly overkill for some subject identity use cases and just right for other subject identity use cases. A matter of requirements.

This is a very good introduction to hypergraphs that concludes with forty-four (44) references to literature you may find useful.

Introductory Econometrics using Quandl and R

Thursday, October 24th, 2013

Introductory Econometrics using Quandl and R

From the webpage:

Econometrics applies mathematical and statistical tools to reveal measurable relationships in real world data. Though the ability to find relationships in quantitative data is useful in many professions and academic disciplines, knowledge and experience with econometric processes is limited to those with a college or university background in the subject. However, this does not need to be the case. Modern statistical software has become more powerful, but also much easier to use. In this tutorial, we introduce one of the most useful econometric tools: linear regression. Using detailed examples and clear explanations, we will begin our tutorial with the basic concepts of regression analysis and end with the more complex multiple regression model.

Regression and multiple regression models are used in many domains. Picking up the technique and its terminology will help you recognize poor or deceitful use of it.

Data skepticism is a useful skill whether you are developing topic maps or not.

On Demand Memory Specialization for Distributed Graph Databases

Thursday, October 24th, 2013

On Demand Memory Specialization for Distributed Graph Databases by Xavier Martinez-Palau, David Dominguez-Sal, Reza Akbarinia, Patrick Valduriez, Josep Lluís Larriba-Pey.


In this paper, we propose the DN-tree that is a data structure to build lossy summaries of the frequent data access patterns of the queries in a distributed graph data management system. These compact representations allow us an efficient communication of the data structure in distributed systems. We exploit this data structure with a new Dynamic Data Partitioning strategy (DYDAP) that assigns the portions of the graph according to historical data access patterns, and guarantees a small network communication and a computational load balance in distributed graph queries. This method is able to adapt dynamically to new workloads and evolve when the query distribution changes. Our experiments show that DYDAP yields a throughput up to an order of magnitude higher than previous methods based on cache specialization, in a variety of scenarios, and the average response time of the system is divided by two.

Graph partitioning based on evolving and summarized query experience. Results in a min cut on far less than the whole graph.

A heavy read but promising approach.

Makes me curious about non-uniform merging priorities.

For example, there could be some topics that merge frequently, say those representing financial information. While others, say for historical figures, may merge infrequently.

Rather than treating all topics equally in terms of computational resources, some topics could be processed at higher priority than others. (I am assuming an appropriately defined SLA.)

Although the partitioning approach could be applicable to distributed topic maps as well.

I Mapreduced a Neo store:…

Thursday, October 24th, 2013

I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop by Kris Geusebroek. (Berlin Buzzwords 2013)

From the description:

When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.

When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).

In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.

If you find the slides hard to read (I did) you may want to try:

Combining Neo4J and Hadoop (part I) and,

Combining Neo4J and Hadoop (part II)

A recent update from Chris: I MapReduced a Neo4j store.

BTW, the code is on github.

Just in case you have any modest sized graphs that you want to play with in Neo4j. 😉

PS: I just found Chris’s slides:

The Gap Between Documents and Answers

Thursday, October 24th, 2013

I mentioned the webinar: Driving Knowledge-Worker Performance with Precision Search Results a few days ago in Findability As Value Proposition.

There was one nugget (among many) in the webinar before I lose sight of how important it is to topic maps and semantic technologies in general.

Dan Taylor (Earley and Associates) was presenting a maturation diagram for knowledge technologies.

See the presentation for the details but what struck me was than on the left side (starting point) there were documents. On the right side (the goal) were answers.

Think about that for a moment.

When you search in Google or any other search engine, what do you get back? Pointers to documents, presentations, videos, etc.

What task remains? Digging out answers from those documents, presentations, videos.

A mature knowledge technology goes beyond what an average user is searching for (the Google model) and returns information based on a specific user for a particular domain, that is, an answer.

For the average user there may be no better option than to drop them off in the neighborhood of a correct answer. Or what may be a correct answer to the average user. No guarantees that you will find it.

The examples in the webinar are in specific domains where user queries can be modeled accurately enough to formulate answers (not documents) to answer queries.

Reminds me of TaxMap. You?

If you want to do a side by side comparison, try USC: Title 26 – Internal Revenue Code. From the Legal Information Institute (Cornell)

Don’t get me wrong, the Cornell materials are great but they reflect the U.S. Code, nothing more or less. That is to say the text you find there isn’t engineered to provide answers. 😉

I will update this point with the webinar address as soon as it appears.

Are all the “Facts” in a Topic Map True? [Reporting on the NSA]

Thursday, October 24th, 2013

Topic maps are not constrained to report “true facts.”

The Topic Maps Data Model (TMDM, 5.3.1 Subjects and topics) states:

A subject can be anything whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever. In particular, it is anything about which the creator of a topic map chooses to discourse. (emphasis in the original)

Which is fortunate for topic map authors who are tracking the false claims that NSA surveillance has prevented 54 terrorist attacks.

Claim on “Attacks Thwarted” by NSA Spreads Despite Lack of Evidence by Justin Elliott and Theodoric Meyer, reports:

Earlier this month, Sen. Patrick Leahy, D-Vt., pressed Alexander on the issue at a Senate Judiciary Committee hearing.

“Would you agree that the 54 cases that keep getting cited by the administration were not all plots, and of the 54, only 13 had some nexus to the U.S.?” Leahy said at the hearing. “Would you agree with that, yes or no?”

“Yes,” Alexander replied, without elaborating.

“We’ve heard over and over again the assertion that 54 terrorist plots were thwarted” by the two programs, Leahy told Alexander at the Judiciary Committee hearing this month. “That’s plainly wrong, but we still get it in letters to members of Congress, we get it in statements. These weren’t all plots and they weren’t all thwarted. The American people are getting left with the inaccurate impression of the effectiveness of NSA programs.”

To track the spread of false facts, see the excellent visualization in How the NSA’s Claim on Thwarted Terrorist Plots Has Spread by By Sisi Wei, Theodoric Meyer and Justin Elliott.

With a topic map you could connect the spreaders of those lies with other lies they have spread on the same subject, other lies they have spread and their relationships to others who spread lies.

The NSA may be accidentally tracking terrorists every now and again.

What do you say to tracking the polluters of public policy discussions?

HDP 2.0 and its YARN-based architecture…delivered!

Wednesday, October 23rd, 2013

HDP 2.0 and its YARN-based architecture…delivered! By Shaun Connolly.

From the post:

Typical delivery of enterprise software involves a very controlled date with a secret roadmap designed to wow prospects, customers, press and analysts…or at least that is the way it usually works. Open source, however, changes this equation.

As described here, the vision for extending Hadoop beyond its batch-only roots in support of interactive and real-time workloads was set by Arun Murthy back in 2008. The initiation of YARN, the key technology for enabling this vision, started in earnest in 2011, was declared GA by the community in the recent Apache Hadoop 2.2 release, and is now delivered for mainstream enterprises and the broader commercial ecosystem with the release of Hortonworks Data Platform 2.0.

HDP 2.0, and its YARN foundation, is a huge milestone for the Hadoop market since it unlocks that vision of gathering all data in Hadoop and interacting with that data in many ways and with predictable performance levels… but you know this because Apache Hadoop 2.2 went GA last week.

If you know Sean, do you think he knows that only 6.4% of projects costing => $10 million succeed? ( website ‘didn’t have a chance in hell’)

HDP 2.0 did take longer than

But putting 18 children in a room isn’t going to produce an 18 year old in a year.

Sean does a great job in this post and points to other HDP 2.0 resources you will want to see.

BTW, maybe you should not mention the 6.4% success rate to Sean. It might jinx how open source software development works and succeeds. 😉

Better (and More) Congressional Member Information

Wednesday, October 23rd, 2013

Better (and More) Congressional Member Information by Derek Willis.

From the post:

Over the summer, as lawmakers in Washington, D.C., haggled about the budget, energy, immigration and other issues, we’ve been at work upgrading the information about members contained in the Congress API.

Part of that process involved adding new elements to some of our member responses, but just as important is that with this update, the API is now part of a larger congressional data infrastructure effort that extends outside The Times.

First, we’ve added more detail to the member and member list responses, which now include a broader range of social media identifiers and other data that will help make it easier to connect to other sources of information. In particular, member responses now include Twitter and Facebook account names as well as the Facebook “id” used by the Graph API.

We’ve also added more details about lawmakers’ websites, including the address and the URL of the RSS feed, if one is present. There are a number of official congressional sites that do not use RSS, and if you want to retrieve press releases from them (and those sites with feeds), we have released a Ruby gem that does just that.

That is just a part of the improvements that Derek describes in his post.

The sharing of identifiers for members of Congress from different information systems is particularly interesting.

DynamiTE: Parallel Materialization of Dynamic RDF Data

Wednesday, October 23rd, 2013

DynamiTE: Parallel Materialization of Dynamic RDF Data by Jacopo Urbani, Alessandro Margara, Ceriel Jacobs, Frank van Harmelen, Henri Bal.


One of the main advantages of using semantically annotated data is that machines can reason on it, deriving implicit knowledge from explicit information. In this context, materializing every possible implicit derivation from a given input can be computationally expensive, especially when considering large data volumes.

Most of the solutions that address this problem rely on the assumption that the information is static, i.e., that it does not change, or changes very infrequently. However, the Web is extremely dynamic: online newspapers, blogs, social networks, etc., are frequently changed so that outdated information is removed and replaced with fresh data. This demands for a materialization that is not only scalable, but also reactive to changes.

In this paper, we consider the problem of incremental materialization, that is, how to update the materialized derivations when new data is added or removed. To this purpose, we consider the df RDFS fragment [12], and present a parallel system that implements a number of algorithms to quickly recalculate the derivation. In case new data is added, our system uses a parallel version of the well-known semi-naive evaluation of Datalog. In case of removals, we have implemented two algorithms, one based on previous theoretical work, and another one that is more efficient since it does not require a complete scan of the input.

We have evaluated the performance using a prototype system called DynamiTE, which organizes the knowledge bases with a number of indices to facilitate the query process and exploits parallelism to improve the performance. The results show that our methods are indeed capable to recalculate the derivation in a short time, opening the door to reasoning on much more dynamic data than is currently possible.

Not “lite” reading but refreshing to see the dynamic nature of information being taken as a starting point.

Echoes of re-merging on the entry or deletion of information in a topic map. Yes?

Source code online: (Java)

I first saw this in a tweet by Marin Dimitrov.

X* 3.0 Proposed Recommendations

Tuesday, October 22nd, 2013

XQuery 3.0, XPath 3.0, Data Model, Functions and Operators and XSLT and XQuery Serialization 3.0

From the post:

The XML Query Working Group and the XSLT Working Group have published five Proposed Recommendations today:

Comments are welcome through 19 November. Learn more about the Extensible Markup Language (XML) Activity.

What’s today? October 22nd?

You almost have 30 days. 😉

Which one or more are you going to read?

I first saw this in a tweet by Jonathan Robie.

Solr-RA – Solr With RankingAlgorithm WARNING!

Tuesday, October 22nd, 2013

Solr-RA – Solr With RankingAlgorithm WARNING!

Google has become very aggressive with search completions and while trying to search for “solr,” Google completed it to Solr-RA.

On the homepage you will see:

Downloads (it is free) (emphasis added)

But there is also a rather prominent Software Agreement.

I can blow by EULAs with the best of them but I think you may want to read this license before you hit download. Seriously.

I am not going to give you legal advice on what I think it says or does not say.

All I want to do is give you a heads up to read the license for yourself. Then decide on your course of action.

I have no problems with commercial software.

But I do prefer to see “trial” or “limited license,” or some other notice of pending obligation on my part.

Please forward this about.

Active learning, almost black magic

Tuesday, October 22nd, 2013

Active learning, almost black magic by Lars Marius Garshol.

From the post:

I’ve written Duke, an engine for figuring out which records represent the same thing. It works fine, but people find it difficult to configure correctly, which is not so strange. Getting the configurations right requires estimating probabilities and choosing between comparators like Levenshtein, Jaro-Winkler, and Dice coefficient. Can we get the computer to do something people cannot? It sounds like black magic, but it’s actually pretty simple.

I implemented a genetic algorithm that can set up a good configuration automatically. The genetic algorithm works by making lots of configurations, then removing the worst and making more of the best. The configurations that are kept are tweaked randomly, and the process is repeated over and over again. It’s dead simple, but it works fine. The problem is: how is the algorithm to know which configurations are the best? The obvious solution is to have test data that tells you which records should be linked, and which ones should not be linked.

But that leaves us with a bootstrapping problem. If you can create a set of test data big enough for this to work, and find all the correct links in that set, then you’re fine. But how do you find all the links? You can use Duke, but if you can set up Duke well enough to do that you don’t need the genetic algorithm. Can you do it in other ways? Maybe, but that’s hard work, quite possibly harder than just battling through the difficulties and creating a configuration.

So, what to do? For a year or so I was stuck here. I had something that worked, but it wasn’t really useful to anyone.

Then I came across a paper where Axel Ngonga described how to solve this problem with active learning. Basically, the idea is to pick some record pairs that perhaps should be linked, and ask the user whether they should be linked or not. There’s an enormous number of pairs we could ask the user about, but most of these pairs provide very little information. The trick is to select those pairs which teach the algorithm the most.

This great stuff.

Particularly since I have a training problem that lacks a training set. 😉

Looking forward to trying this on “real-world problems” as Lars says. website ‘didn’t have a chance in hell’

Tuesday, October 22nd, 2013 website ‘didn’t have a chance in hell’ by Patrick Thibodeau.

From the post:

A majority of large IT projects fail to meet deadlines, are over budget and don’t make their users happy. Such is the case with

The U.S. is now racing to fix, the Affordability Care Act (ACA) website that launched Oct 1, by bringing in new expertise to fix it.’s problems include site availability due to excessive loads, incorrect data recording among other things.

President Barack Obama said Monday that there is “no excuse” for the problems at the site.

But his IT advisors shouldn’t be surprised — the success rate for large, multi-million dollar commercial and government IT projects is very low.

The Standish Group, which has a database of some 50,000 development projects, looked at the outcomes of multimillion dollar development projects and ran the numbers for Computerworld.

Of 3,555 projects from 2003 to 2012 that had labor costs of at least $10 million, only 6.4% were successful. The Standish data showed that 52% of the large projects were “challenged,” meaning they were over budget, behind schedule or didn’t meet user expectations. The remaining 41.4% were failures — they were either abandoned or started anew from scratch.

“They didn’t have a chance in hell,” said Jim Johnson, founder and chairman of Standish, of “There was no way they were going to get this right – they only had a 6% chance,” he said.

There is one reason that wasn’t offered for the failure.

Let me illustrate that reason.

In the Computer World article I quoted above, the article mentions the FBI tanking the $170 million virtual case initiative.

Contractor: SAIC.

Just last month I saw this notice:

XXXXXXXXX, San Diego, Calif., was awarded a $35,883,761 cost-plus-incentive-fee contract for software engineering, hardware, integration, technical support, and training requirements of the Integrated Strategic Planning and Analysis Network targeting function, including the areas of National Target Base production and National Desired Ground Zero List development. Work is being performed at Offut Air Force Base, Neb., with an expected completion date of Sept. 30 2018. This contract was a competitive acquisition and two offers were received. No funds have been obligated at time of award. The 55th Contracting Squadron at Offut Air Force Base, Neb., is the contracting activity. (FA4600-13-D-0001) [From Defense News and Career Advice, September 19, 2013.]

Can you guess who the contractor is in that $35 million award?

If you guessed SAIC, you would be correct!

Where is the incentive to do a competent job on any contract?

If you fail on a government contract, you get to keep the money.

Not to mention that you are still in line for more $multi-million dollar contracts.

I’m not on that gravy train but I don’t think that is what bothers me.

Doing poor quality work, in software projects or anywhere else, diminishes all the practitioners in a particular profession.

The first step towards a solution is for government and industry to stop repeating business with software firms that fail.

If smaller firms can’t match the paperwork/supervision required by layers of your project management, that’s a clue you need to do internal house cleaning.

Remember the quote about what is defined by doing the same thing over and over and expecting a different result?

Titanic Machine Learning from Disaster (Kaggle Competition)

Tuesday, October 22nd, 2013

Titanic Machine Learning from Disaster (Kaggle Competition) by Andrew Conti.

From the post (and from the Kaggle page):

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning.”

From Andrew’s post:

Goal for this Notebook:

Show a simple example of an analysis of the Titanic disaster in Python using a full complement of PyData utilities. This is aimed for those looking to get into the field or those who are already in the field and looking to see an example of an analysis done with Python.

This Notebook will show basic examples of:

Data Handling

  • Importing Data with Pandas
  • Cleaning Data
  • Exploring Data through Visualizations with Matplotlib

Data Analysis

  • Supervised Machine learning Techniques:
    • Logit Regression Model
    • Plotting results
  • Unsupervised Machine learning Techniques
    • Support Vector Machine (SVM) using 3 kernels
    • Basic Random Forest
    • Plotting results

Valuation of the Analysis

  • K-folds cross validation to valuate results locally
  • Output the results from the IPython Notebook to Kaggle

Required Libraries:

This is wicked cool!

I first saw this in Kaggle Titanic Contest Tutorial by Danny Bickson.

PS: Don’t miss Andrew Conti’s new homepage.

New SOSP paper: a lightweight infrastructure for graph analytics

Monday, October 21st, 2013

New SOSP paper: a lightweight infrastructure for graph analytics by Danny Bickson.

Danny cites a couple of new graph papers that will be of interest:

I got this reference from my collaborator Aapo Kyorla, author of GraphChi.

A Lightweight Infrastructure for Graph Analytics. Donald Nguyen, Andrew Lenharth, Keshav Pingali (University of Texas at Austin), to appear in SOSP 2013.

It is an interesting paper which heavily compares to GraphLab, PowerGraph (GraphLab v2.1) and

One of the main claims is that dynamic and asynchronous scheduling can significantly speed up many graph algorithms (vs. bulk synchronous parallel model where all graph nodes are executed on each step).

Some concerns I have is regarding the focus on multicore settings, which makes everything much easier, and thus to comparison with PowerGraph less relevant.


Another relevant paper which improves on GraphLab is: Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC. Zhiyuan Lin, Duen Horng Chau, and U Kang, IEEE Big Data Workshop: Scalable Machine Learning: Theory and Applications, 2013. The basic idea is to speed graph loading using mmap() operation.

One of the things I like about Danny’s posts is that he is trying to improve graph processing for everyone, not cooking numbers for his particular solution.


Apache Hive 0.12: Stinger Phase Two… DELIVERED [Unlike Obamacare]

Monday, October 21st, 2013

Apache Hive 0.12: Stinger Phase Two… DELIVERED by Thejas Nair.

From the post:

Stinger is not a product. Stinger is a broad community based initiative to bring interactive query at petabyte scale to Hadoop. And today, as representatives of this open, community led effort we are very proud to announce delivery of Apache Hive 0.12, which represents the critical second phase of this project!

Only five months in the making, Apache Hive 0.12 comprises over 420 closed JIRA tickets contributed by ten companies, with nearly 150 thousand lines of code! This work is perfectly representative of our approach… it is a substantial release with major contributions from a wide group of talented engineers from Microsoft, Facebook , Yahoo and others.

Delivery of SQL-IN-Hadoop Marches

The Stinger Initiative was announced in February and as promised, we have seen consistent regular delivery of new features and improvements as outlined in the Stinger plan. There are three roadmap vectors for Stinger: Speed, Scale and SQL. Each phase of the initiative advances on all three goals and this release provides a significant increase in SQL semantics, adding the VARCHAR and DATE datatypes and improving performance ORDER by and GROUP by. Several features to optimize queries have also been added.

We also contributed numerous “under the hood” improvements, ie refactoring code and making it easier to build on top of hive – getting rid of some of the technical debt. This helps us deliver further optimizations in the long term, especially for the upcoming Apache Tez integration.

A complete list of the notable improvements included in the release is listed here and expect an updated performance benchmark soon!

It is so nice to see a successful software project!

And an open source one at that!

Unlike the no bid IT mega-failure that is Obamacare.

Maybe there is something to having a good infrastructure for code development as opposed to contractors billing by the phone call, lunch meeting and hour.

BTW, all the protests about the volume of users trying to register with Obamacare? More managerial incompetence.

When you are rolling out a system for potentially 300 million+ users, don’t you anticipate load as part of the requirements?

If you didn’t, there is the start of the trail of managerial incompetence in Obamacare.

Responsive maps with D3.js

Monday, October 21st, 2013

Responsive maps with D3.js by Nathan Yau.

Nathan has uncovered three posts on creating responsive (to your device) maps, charts, and legends.

Chris Amico on using D3.js to create responsive maps, charts, and legends.


Work Efficient Parallel Algorithms for Large Graph Exploration

Monday, October 21st, 2013

Work Efficient Parallel Algorithms for Large Graph Exploration by Dip Sankar Banerjee, Shashank Sharma, Kishore Kothapalli.


Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-of-the-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.

This paper makes me curious if pruning could be a viable solution for processing very large topic maps?

I may have a map that includes all of Wikipedia but for some purposes, I have no interest in any of the Wikipedia data.

Or I may have different accounting entries that display depending upon the user accessing my accounting map. Totals, audit trails should be generated from one consistent set of entries.

Any experience or thoughts in that regard?

TripleRush: A Fast and Scalable Triple Store

Monday, October 21st, 2013

TripleRush: A Fast and Scalable Triple Store by Philip Stutz, Mihaela Verman, Lorenz Fischer, and Abraham Bernstein.


TripleRush is a parallel in-memory triple store designed to address the need for efficient graph stores that quickly answer queries over large-scale graph data. To that end it leverages a novel, graph-based architecture.

Specifi cally, TripleRush is built on our parallel and distributed graph processing framework Signal/Collect. The index structure is represented as a graph where each index vertex corresponds to a triple pattern. Partially matched copies of a query are routed in parallel along di fferent paths of this index structure.

We show experimentally that TripleRush takes less than a third of the time to answer queries compared to the fastest of three state-of-the-art triple stores, when measuring time as the geometric mean of all queries for two benchmarks. On individual queries, TripleRush is up to three orders of magnitude faster than other triple stores.

If the abstract hasn’t already gotten your interest, consider the following:

The index graph we just described is di fferent from traditional index structures, because it is designed for the efficient parallel routing of messages to triples that correspond to a given triple pattern. All vertices that form the index structure are active parallel processing elements that only interact via message passing.

That is the beginning to section “3.2 Query Processing.” It has a worked example that will repay a close reading.

The processing model outlined here is triple specific, but I don’t see any reason why the principles would not work for other graph structures.

This is going to the top of my reading list.

I first saw this in a tweet by Stefano Bertolo.