Archive for November, 2013

The Distributed Complexity of Large-scale Graph Processing

Tuesday, November 26th, 2013

The Distributed Complexity of Large-scale Graph Processing by Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, Peter Robinson.


Motivated by the increasing need for fast distributed processing of large-scale graphs such as the Web graph and various social networks, we study a message-passing distributed computing model for graph processing and present lower bounds and algorithms for several graph problems. This work is inspired by recent large-scale graph processing systems (e.g., Pregel and Giraph) which are designed based on the message-passing model of distributed computing.

Our model consists of a point-to-point communication network of k machines interconnected by bandwidth-restricted links. Communicating data between the machines is the costly operation (as opposed to local computation). The network is used to process an arbitrary n-node input graph (typically k machines (a common implementation in many real world systems). Our goal is to study fundamental complexity bounds for solving graph problems in this model.

We present techniques for obtaining lower bounds on the distributed time complexity. Our lower bounds develop and use new bounds in random-partition communication complexity. We first show a lower bound of \Omega(n/k) rounds for computing a spanning tree (ST) of the input graph. This result also implies the same bound for other fundamental problems such as computing a minimum spanning tree (MST). We also show an \Omega(n/k^2) lower bound for connectivity, ST verification and other related problems.

We give algorithms for various fundamental graph problems in our model. We show that problems such as PageRank, MST, connectivity, and graph covering can be solved in \tilde{O}(n/k) time, whereas for shortest paths, we present algorithms that run in \tilde{O}(n/\sqrt{k}) time (for (1+\epsilon)-factor approx.) and in \tilde{O}(n/k) time (for O(\log n)-factor approx.) respectively.

The author’s state their main goal is:

…is to investigate the distributed time complexity, i.e., the number of distributed “rounds”, for solving various fundamental graph problems. The time complexity not only captures the (potential) speed up possible for a problem, but it also implicitly captures the communication cost of the algorithm as well, since links can transmit only a limited amount of bits per round; equivalently, we can view our model where instead of links, machines can send/receive only a limited amount of bits per round (cf. Section 1.1).

How would you investigate the number of “rounds,” to perform merging in a message passing topic map system?

With no one order of merging to reach a particular state, would you measure it statistically for some merging criteria N?

I first saw this in a tweet by Stefano Bertolo.

Ten Quick Tips for Using the Gene Ontology

Tuesday, November 26th, 2013

Ten Quick Tips for Using the Gene Ontology by Judith A. Blake.

From the post:

The Gene Ontology (GO) provides core biological knowledge representation for modern biologists, whether computationally or experimentally based. GO resources include biomedical ontologies that cover molecular domains of all life forms as well as extensive compilations of gene product annotations to these ontologies that provide largely species-neutral, comprehensive statements about what gene products do. Although extensively used in data analysis workflows, and widely incorporated into numerous data analysis platforms and applications, the general user of GO resources often misses fundamental distinctions about GO structures, GO annotations, and what can and can not be extrapolated from GO resources. Here are ten quick tips for using the Gene Ontology.

Tip 1: Know the Source of the GO Annotations You Use

Tip 2: Understand the Scope of GO Annotations

Tip 3: Consider Differences in Evidence Codes

Tip 4: Probe Completeness of GO Annotations

Tip 5: Understand the Complexity of the GO Structure

Tip 6: Choose Analysis Tools Carefully

Tip 7: Provide the Version of the Data/Tools Used

Tip 8: Seek Input from the GOC Community and Make Use of GOC Resources

Tip 9: Contribute to the GO

Tip 10: Acknowledge the Work of the GO Consortium

See Judith’s article for her comments and pointers under each tip.

The take away here is that an ontology may have the information you are looking for, but understanding what you have found is an entirely different matter.

For GO, follow Judith’s specific suggestions/tips, for any other ontology, take steps to understand the ontology before relying upon it.

I first saw this in a tweet by Stephen Turner.

The curse of NOARK

Tuesday, November 26th, 2013

The curse of NOARK by Lars Marius Garshol.

From the post:

I’m writing about a phenomenon that’s specifically Norwegian, but some things are easier to explain to foreigners, because we Norwegians have been conditioned to accept them. In this case I’m referring to the state of the art for archiving software in the Norwegian public sector, where everything revolves around the standard known as NOARK.

Let’s start with the beginning. Scandinavian chancelleries have a centuries-long tradition for a specific approach to archiving, which could be described as a kind of correspondence journal. Essentially, all incoming and outgoing mail, as well as important internal documents, were logged in a journal, with title, from, to, and date for each document. In addition, each document would be filed under a “sak”, which translates roughly as “case” or “matter under consideration”. Effectively, it’s a kind of tag which ties together one thread of documents relating to a specific matter.

The classic example is if the government receives a request of some sort, then produces some intermediate documents while processing it, and then sends a response. Perhaps there may even be couple of rounds of back-and-forth with the external party. This would be an archetypal “sak” (from now on referred to as “case”), and you can see how having all these documents in a single case file would be absolutely necessary for anyone responding to the case. In fact, it’s not dissimilar to the concept of an issue in an issue-tracking system.

In this post and its continuation in Archive web services: a missed opportunity Lars details the shortcomings of the NOARK standard.

To some degree specifically Norwegian but the problem of poor IT design is truly an international phenomena.

I haven’t made any suggestions since the U.S. is home to the virtual case management debacle, the incredible melting NSA data center, not to mention the non-functional health care IT system known as

Read these posts by Lars because you will encounter projects before mistakes similar to the ones Lars describes have been set in stone.

No guarantees of success but instead of providing semantic data management on top of a broken IT system, you could be providing semantic data management on top of a non-broken IT system.

Perhaps never a great IT system but I would settle for a non-broken one any day.

A Book About Bandit Algorithms

Tuesday, November 26th, 2013

A Book About Bandit Algorithms by Noel Welsh.

From the introduction:

…The basic bandit problem is this:

We are alone in a room full of slot machines (fruit machines, or one-armed bandits). At every tick of the clock we must choose a machine to play. When we pull the arm on a machine we get a reward – perhaps a river of coins tumbles from the machine, or perhaps nothing at all happens – but we don’t know what reward we will get till we try the arm. By pulling arms we can learn about the machines, and we must use this knowledge to choose the best arms to pull.

Though I’ve tried to make the problem concrete (and also show how the name multi-armed bandit arises) my description leaves many things undefined:

  • I haven’t said how the machines behave. Do they always reward at the same rate, or does the rate of reward change arbitrarily, or even capriciously?
  • Do we know anything about the machines before we start? Are some of them, for example, from the same manufacturer, so we might assume they behave similarly?
  • What exactly is the goal? Are we playing this game forever, and if so do we want to make the most money possible? Alternatively, we might want to find the best machine as quickly as possible, so we can go do something more interesting with our time instead.

All of these different variations are bandit problems that we will discuss. The essence of a bandit problem is meeting two criteria. Firstly, we only learn about actions when we take them. Thus bandits are partial information problems. We also require that our actions do not change the machines’ behaviour, though their behaviour may change for other reasons. This important restriction means we don’t have to consider the effect our past actions might have on the present, and makes bandit algorithms tractable to solve. If we drop this restriction we have a problem known as reinforcement learning.

Is merging a “partial information problem (PIP)?”

Or does that answer depend upon the complexity of the merging criteria?

If the status of merging as a PIP rests on the complexity of merging criteria, what is the boundary where merging becomes a PIP?

A book in progress so be sure to sign up for new content and assist Noel with comments.

I first saw this mentioned in a tweet by Lars Marius Garshol.

MAST Discovery Portal

Monday, November 25th, 2013

A New Way To Search, A New Way To Discover: MAST Discovery Portal Goes Live

From the post:

MAST is pleased to announce that the first release of our Discovery Portal is now available. The Discovery Portal is a one-stop web interface to access data from all of MAST’s supported missions, including HST, Kepler, GALEX, FUSE, IUE, EUVE, Swift, and XMM. Currently, users can search using resolvable target names or coordinates (RA and DEC). The returned data include preview plots of the data (images, spectra, or lightcurves), sortable columns, and advanced filtering options. An accompanying AstroViewer projects celestial sky backgrounds from DSS, GALEX, or SDSS on which to overlay footprints from your search results. A details panel allows you to see header information without downloading the file, visit external sites like interactive displays or MAST preview pages, and cross-search with the Virtual Observatory. In addition to searching MAST, users can also search the Virtual Observatory based on resolvable target names or coordinates, and download data from the VO directly through the Portal (Spitzer, 2MASS, WISE, ROSAT, etc.) You can quickly download data one row at a time, or add items to your Download Cart as you browse for download when finished, much like shopping online. Basic plotting tools allow you to visualize metadata from your search results. Users can also upload their own tables of targets (IDs and coordinates) for use within the Portal. Cross-matching can be done with all MAST data or any data available through the CDS at Strasbourg. All of these features interact with each other: you can use the charts to drag and select data points on a plot, whose footprints are highlighted in the AstroViewer and whose returned rows are brought to the top of your search results grid for further download or exploration.

Just a quick reminder that not every data mining project is concerned with recommendations of movies or mining reviews.

Seriously, astronomy has been dealing with “big data” long before it became a buzz word.

When you are looking for new techniques or insights into data exploration, check my posts under astroinformatics.

Integrating R with Cloudera Impala…

Monday, November 25th, 2013

Integrating R with Cloudera Impala for Real-Time Queries on Hadoop by Istvan Szegedi.

From the post:

Cloudera Impala supports low-latency, interactive queries on Hadoop data sets either stored in Hadoop Distributed File System (HDFS) or HBase, the distributed NoSQL database for Hadoop. Impala’s notion is to use Hadoop as a storage engine but move away from MapReduce algorithms. Instead, Impala uses distributed queries, a concept inherited from massive parallel processing databases. As a result, Impala supports SQL-like query languange (in the same way way as Apache Hive), but can execute the queries 10-100 times fasters than Hive that converts them into MapReduce. You can find more details on Impala in one of the previous posts.

R is one of the most popular open source statistical computing and graphical software. It can work with various data sources from comma separated files to web contents referred by URLs to relational databases to NoSQL (e.g. MongoDB or Cassandra) and Hadoop.

Thanks to the generic Impala ODBC driver, R can be integrated with Impala, too. The solution will provide fast, interactive queries running on top of Hadoop data sets and then the data can be further processed or visualized within R.

Have you noticed that newer technologies (Hadoop) are becoming accessible to more traditional tools (R)?

Which will move traditional tool users towards newer technologies.

The combining of the new with the traditional has a distinct odor.

I think it is called success. 😉

DBpedia as Tables [21st Century Interchange]

Monday, November 25th, 2013

DBpedia as Tables

From the webpage:

As some of the potential users of DBpedia might not be familiar with the RDF data model and the SPARQL query language, we provide some of the core DBpedia 3.9 data also in tabular form as Comma-Separated-Values (CSV) files, which can easily be processed using standard tools, such as spreadsheet applications, relational databases or data mining tools.

For each class in the DBpedia ontology (such as Person, Radio Station, Ice Hockey Player, or Band) we provide a single CSV file which contains all instances of this class. Each instance is described by its URI, an English label and a short abstract, the mapping-based infobox data describing the instance (extracted from the English edition of Wikipedia), and geo-coordinates.

Altogether we provide 530 CVS files in the form of a single ZIP file (size 3 GB compressed and 73.4 GB when uncompressed).

The ZIP file can be downloaded here (3 GB).


I have to admit that I applaud the move to release DBpedia as CSV files.

Despite my long time and continuing allegiance to XML, there are times when CSV is the optimum format for interchange.

You need to mark this date on your calendar.

I am curious what projects will appear using DBpedia data, based on the CSV version, in the next calendar year?

I first saw this in a tweet by Nicolas Torzec.

Faunus 4.1 Release

Monday, November 25th, 2013

Faunus 4.1 Release

I don’t find this change reflected in the 4.1 release notes but elsewhere Marko Rodriguez writes:

I tested the new code on a subset of the Friendster data (6 node Hadoop and 6 node Cassandra cluster).

    vertices: 7 minutes to write 39 million vertices at ~100mb/second from the Hadoop to the Cassandra cluster.

  • edges: 15 minutes to write 245 million edges at ~40mb/second from the Hadoop to the Cassandra cluster.

This is the fastest bulk load time I’ve seen to date. This means, DBPedia can be written in ~20 minutes! I’ve attached an annotated version of the Ganglia monitor to the email that shows the outgoing throughput for the various stages of the MapReduce job. In the past, I was lucky to get 5-10mb/second out of the edge writing stage (this had to do with how I was being dumb about how reduce worked in Hadoop — wasn’t considering the copy/shuffle aspect of the stage).

At this rate, this means we can do billion edges graphs in a little over 1 hour. I bet though I can now speed this up more with some parameter tweaking as I was noticing that Cassandra was RED HOT and locking up a few times on transaction commits. Anywho, Faunus 0.4.1 is going to be gangbusters!

Approximately one billion edges an hour?

It’s not > /dev/null speed but still quite respectable. 😉

Faunus 0.4.1 wikidoc.

Download Faunus 0.4.1.

Titan 4.1 Release

Monday, November 25th, 2013

Titan 4.1 Release

From the release notes:

Tested Compatibility:

  • Cassandra 1.2.2
  • HBase 0.94.12
  • BerkeleyJE 5.0.73
  • Elasticsearch 0.90.5
  • Lucene 4.4.0
  • Persistit 3.3.0
  • Java 1.7+ (partially compatible with Java 1.6)


  • Property pre-fetching to speed up multiple property lookups per vertex. Configurable through fast-property option.
  • Shortened HBase column-family names to reduce the HBase storage footprint. This feature is disabled by default for backwards-compatibility. Enable it via storage.short-cf-names
  • Metrics per Transaction: Gathering measurements on the transaction level and group them by transaction template name configurable through graph.buildTransaction().setMetricsPrefix(String)
  • Metrics Ganglia and Graphite support
  • Improvements to the internal memory structures and algorithms of a Titan transaction which lead to much improved traversal times (a lot of credit goes to Pavel for these optimizations!!)
  • Added database level cache for lower latency query answering against warm data. Enable via cache.db-cache. Learn more about Database Cache.
  • Better caching implementation for relations (RelationCache) to provide faster de-serialization performance
  • Addition of a new query optimizer that can significantly speed up a subset of traversals
  • Support for reverse ordering in vertex centric queries by defining: makeLabel(..).sortKey(..).sortOrder(Order.DESC).make()
  • Support for index configuration parameters passed into KeyMaker.indexed(String,Class,Parameter…) to change the default indexing behavior of an indexing backend.
  • Support for TEXT and STRING mapping of strings in both Lucene and ElasticSearch configurable as a parameter. Learn more about Full Text and String Search
  • Refactored Text.REGEX/PREFIX to Text.CONTAINS_REGEX/CONTAINS_PREFIX to accurately reflect their semantics. Added Text.REGEX/PREFIX for full string matching. See Indexing Backend Overview
  • Added support for scaling the id allocation to hundreds of parallel Titan instances through additional configuration options. See Bulk Loading.


  • Fixed multiQuery() for specific has() conditions. Added support for multiQuery(Collection).
  • Fixed limit adjustment issue for unconstraint IN/OUT queries
  • Fixed packaging issues
  • Fixed cache misses due to wrong limit interpretation

Looks like it is time for an upgrade!

BTW, this is an experimental version so NSFP (Not Safe For Production).

Titan 4.1 wikidoc.

Download Titan 4.1.

Fast Search and Analytics on Hadoop with Elasticsearch

Monday, November 25th, 2013

Fast Search and Analytics on Hadoop with Elasticsearch by Lisa Sensmeier.

From the post:

Hortonworks customers can now enhance their Hadoop applications with Elasticsearch real-time data exploration, analytics, logging and search features, all designed to help businesses ask better questions, get clearer answers and better analyze their business metrics in real-time.

Hortonworks Data Platform and Elasticsearch make for a powerful combination of technologies that are extremely useful to anyone handling large volumes of data on a day-to-day basis. With the ability of YARN to support multiple workloads, customers with current investments in flexible batch processing can also add real-time search applications from Elasticsearch.

Not much in the way of substantive content but it does have links to good resources on Hadoop and Elasticsearch.

Yelp Dataset Challenge

Monday, November 25th, 2013

Yelp Dataset Challenge

Deadline: Monday, February 10, 2014.

From the webpage:

Yelp is proud to introduce a deep dataset for research-minded academics from our wealth of data. If you’ve used our Academic Dataset and want something richer to train your models on and use in publications, this is it. Tired of using the same standard datasets? Want some real-world relevance in your research project? This data is for you!

Yelp is bringing you a generous sample of our data from the greater Phoenix, AZ metropolitan area including:

  • 11,537 businesses
  • 8,282 checkin sets
  • 43,873 users
  • 229,907 reviews


If you are a student and come up with an appealing project, you’ll have the opportunity to win one of ten Yelp Dataset Challenge awards for $5,000. Yes, that’s $5,000 for showing us how you use our data in insightful, unique, and compelling ways.

Additionally, if you publish a research paper about your winning research in a peer-reviewed academic journal, then you’ll be awarded an additional $1,000 as recognition of your publication. If you are published, Yelp will also contribute up to $500 to travel expenses to present your research using our data at an academic or industry conference.

If you are a student, see the Yelp webpage for more details. If you are not a student, pass this along to someone who is.

Yes, this is dataset mentioned in How-to: Index and Search Data with Hue’s Search App.

How-to: Index and Search Data with Hue’s Search App

Monday, November 25th, 2013

How-to: Index and Search Data with Hue’s Search App

From the post:

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Don’t be discouraged by the speed of the presenter in the video.

I suspect he is more than “familiar” with the Hue, Solr and the Yelp dataset. 😉

Like all great “how-to” guides you get a very positive outcome.

A positive outcome with minimal effort may be essential reinforcement for new technologies.

Not all citations are equal:… [Semantic Triage?]

Sunday, November 24th, 2013

Not all citations are equal: identifying key citations automatically by Daniel Lemire.

From the post:

Suppose that you are researching a given issue. Maybe you have a medical condition or you are looking for the best algorithm to solve your current problem.

A good heuristic is to enter reasonable keywords in Google Scholar. This will return a list of related research papers. If you are lucky, you may even have access to the full text of these research papers.

Is that good enough? No.

Scholarship, on the whole, tends to improve with time. More recent papers incorporate the best ideas from past work and correct mistakes. So, if you have found a given research paper, you’d really want to also get a list of all papers building on it…

Thankfully, a tool like Google Scholar allows you to quickly access a list of papers citing a given paper.

Great, right? So you just pick your research paper and review the papers citing them.

If you have ever done this work, you know that most of your effort will be wasted. Why? Because most citations are shallow. Almost none of the citing papers will build on the paper you picked. In fact, many researchers barely even read the papers that they cite.

Ideally, you’d want Google Scholar to automatically tell apart the shallow citations from the real ones.

The paper of the same title is due to appear in JASIST.

The abstract:

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation.

By asking authors to identify the key references in their own work, we created a dataset in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this dataset using only four features.

The best features, among those we evaluated, were features based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

What I find intriguing is the potential for this type of research to enable a type of semantic triage when creating topic maps or other semantic resources.

If only three out of thirty citations in a paper are determined to be “influential,” why should I use scarce resources to capture them as completely as the influential resources?

The corollary to Daniel’s “not all citations are equal,” is that “not all content is equal.”

We already make those sort of choices when we select some citations from the larger pool of possible citations.

I’m just suggesting that we make that decision explicit when creating semantic resources.

PS: I wonder how Daniel’s approach would work with opinions rendered in legal cases. Court’s often cite an entire block of prior decisions but no particular rule or fact from any of them. Could reduce the overhead of tracking influential prior case decisions.


Sunday, November 24th, 2013

GDELT: The Global Database of Events, Language, and Tone

From the about page:

The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world over the last two centuries down to the city level globally, to make all of this data freely available for open research, and to provide daily updates to create the first “realtime social sciences earth observatory.” Nearly a quarter-billion georeferenced events capture global behavior in more than 300 categories covering 1979 to present with daily updates.

GDELT is designed to help support new theories and descriptive understandings of the behaviors and driving forces of global-scale social systems from the micro-level of the individual through the macro-level of the entire planet by offering realtime synthesis of global societal-scale behavior into a rich quantitative database allowing realtime monitoring and analytical exploration of those trends.

GDELT’s evolving ability to capture ethnic, religious, and other social and cultural group relationships will offer profoundly new insights into the interplay of those groups over time, offering a rich new platform for understanding patterns of social evolution, while the data’s realtime nature will expand current understanding of social systems beyond static snapshots towards theories that incorporate the nonlinear behavior and feedback effects that define human interaction and greatly enrich fragility indexes, early warning systems, and forecasting efforts.

GDELT’s goal is to help uncover previously-obscured spatial, temporal, and perceptual evolutionary trends through new forms of analysis of the vast textual repositories that capture global societal activity, from news and social media archives to knowledge repositories.

Key Features

  • Covers all countries globally
  • Covers a quarter-century: 1979 to present
  • Daily updates every day, 365 days a year
  • Based on cross-section of all major international, national, regional, local, and hyper-local news sources, both print and broadcast, from nearly every corner of the globe, in both English and vernacular
  • 58 fields capture all available detail about event and actors
  • Ten fields capture significant detail about each actor, including role and type
  • All records georeferenced to the city or landmark as recorded in the article
  • Sophisticated geographic pipeline disambiguates and affiliates geography with actors
  • Separate geographic information for location of event and for both actors, including GNS and GNIS identifiers
  • All records include ethnic and religious affiliation of both actors as provided in the text
  • Even captures ambiguous events in conflict zones (“unidentified gunmen stormed the mosque and killed 20 civilians”)
  • Specialized filtering and linguistic rewriting filters considerably enhance TABARI’s accuracy
  • Wide array of media and emotion-based “importance” indicators for each event
  • Nearly a quarter-billion event records
  • 100% open, unclassified, and available for unlimited use and redistribution

The download page lists various data sets, including the GDELT Global Knowledge Graph and daily downloads of intake data.

If you are looking for data to challenge your graph, topic map or data mining skills, GDELT is the right spot.


Sunday, November 24th, 2013

R-Fiddle: An online playground for R code

From the post: is an early stage beta that provides you with a free and powerful environment to write, run and share R-code right inside your browser. It even offers the option to include packages. Since a couple of days it’s gaining more and more traction, and was mentioned on the frontpage of Hacker News.

We designed it for those situations where you have code that you need to prototype quickly and then possibly share it with others for feedback. All this without needing a user account, or any scrap projects or files! We even included a very-easy-to-use ‘embed’ function for blogs and website, so your visitors can edit and run R code on your own website or blog. This is the first version of R-fiddle, so do not hesitate to give us feedback.

Working together with the help of R-fiddle

You can use R-fiddle to share code snippets with colleagues when tossing around ideas, in order to find that annoying bug, or by making your own variations on others people code. It’s easy: Just go to, type your code, and get your public URL by pressing ‘share’. This is a lot easier for your potential troubleshooter/colleague/.. since (s)he can immediate run and check the code, save it once finished and share it again. So by sharing your R-code through R-fiddle, you can not only help others to better understand your code, but they can also help you!

Using R-Fiddle in a distance education context, especially on data mining with R, will benefit both instructors and students.

Under the Hood: [of RocksDB]

Sunday, November 24th, 2013

Under the Hood: Building and open-sourcing RocksDB by Dhruba Borthakur.

From the post:

Every time one of the 1.2 billion people who use Facebook visits the site, they see a completely unique, dynamically generated home page. There are several different applications powering this experience–and others across the site–that require global, real-time data fetching.

Storing and accessing hundreds of petabytes of data is a huge challenge, and we’re constantly improving and overhauling our tools to make this as fast and efficient as possible. Today, we are open-sourcing RocksDB, an embeddable, persistent key-value store for fast storage that we built and use here at Facebook.

Why build an embedded database?

Applications traditionally access their data via remote procedure calls over a network connection, but that can be slow–especially when we need to power user-facing products in real time. With the advent of flash storage, we are starting to see newer applications that can access data quickly by managing their own dataset on flash instead of accessing data over a network. These new applications are using what we call an embedded database.

There are several reasons for choosing an embedded database. When database requests are frequently served from memory or from very fast flash storage, network latency can slow the query response time. Accessing the network within a data center can take about 50 microseconds, as can fast-flash access latency. This means that accessing data over a network could potentially be twice as slow as an application accessing data locally.

Secondly, we are starting to see servers with an increasing number of cores and with storage-IOPS reaching millions of requests per second. Lock contention and a high number of context switches in traditional database software prevents it from being able to saturate the storage-IOPS. We’re finding we need new database software that is flexible enough to be customized for many of these emerging hardware trends.

Like most of you, I don’t have 1.2 billion people visiting my site. 😉

However, understanding today’s “high-end” solutions will prepare you for tomorrow’s “middle-tier” solution and day after tomorrow’s desktop solution.

A high level overview of RocksDB.

Other resources to consider:

RocksDB Facebook page.

RocksDB on Github.

Update: Igor Canadi has posted to the Facebook page a proposal to add the concept of ColumnFamilies to RocksDB. Comments? (Direct comments on that proposal to the Facebook page for RocksDB.)

Multi-term Synonym Mapping in Solr

Sunday, November 24th, 2013

Why is Multi-term synonym mapping so hard in Solr? by John Berryman.

From the post:

There is a very common need for multi-term synonyms. We’ve actually run across several use cases among our recent clients. Consider the following examples:

  • Ecommerce: If a customer searches for “weed whacker”, but the more canonical name is “string trimmer”, then you need synonyms, otherwise you’re going to lose a sale.
  • Law: Consider a layperson attempting to find a section of legal code pertaining to their “truck”. If the law only talks about “motor vehicles”, then, without synonyms, this individual will go away uninformed.
  • Medicine: When a doctor is looking up recent publications on “heart attack”, synonyms make sure that he also finds documents that happen to only mention “myocardial infarction”.

One would hope that working with synonyms should be as simple as tossing a set of synonyms into the synonyms.txt file and just having Solr “do the right thing.”™ And when we’re talking about simple, single-term synonyms (e.g. TV = televisions), synonyms really are just that straight forward. Unfortunately, especially as you get into more complex uses of synonyms, such as multi-term synonyms, there are several gotchas. Sometimes, there are workarounds. And sometimes, for now at least, you’ll just have to make do what you can currently achieve using Solr! In this post we’ll provide a quick intro to synonyms in Solr, we’ll walk through some of the pain points, and then we’ll propose possible resolutions.

John does a great review of basic synonym mapping in Solr as a prelude to illustrating the difficulty with multi-term synonyms.

His example case is the mapping:

spider man ==> spiderman

“Obvious” solutions fail but John does conclude with a pointer to one solution to the issue.

Recommended for a deeper understanding of Solr’s handling of synonymy.

While reading John’s post it occurred to me to check with Wikipedia on disambiguation of the term “spider.”

  • Comics – 17
  • Other publications – 5
  • Culinary – 3
  • Film and television – 10
  • Games and sports – 10
  • Land vehicles – 4
  • Mathematics – 1
  • Music – 16
  • People – 7
  • Technology – 14
  • Other uses – 7

I count eighty-eight (88) distinct “spiders” (counting spider as “an air-breathing eight-legged animal“, of which there are 44032 species identified as of June 23, 2013).

John suggests a parsing solution for the multi-term synonym problem in Solr, but however “spider” is parsed, there remains ambiguity.

An 88-fold ambiguity (at minimum).

At least for Solr and other search engines.

Not so much for us as human readers.

A human reader is not limited to “spider” in deciding which of 88 possible spiders is the correct one and/or the appropriate synonyms to use.

Each “spider” is seen in a “context” and a human reader will attribute (perhaps not consciously) characteristics to a particular “spider” in order to identify it.

If we record characteristics for each “spider,” then distinguishing and matching spiders to synonyms (also with characteristics) becomes a task of:

  1. Deciding which characteristic(s) to require for identification/synonymy.
  2. Fashioning rules for identification/synonymy.

Much can be said about those two tasks but for now, I will leave you with a practical example of their application.

Assume that you are indexing some portion of web space and you encounter The World Spider Catalog, Version 14.0.

We know for every instance of “spider” (136) at that site has the characteristics of order Araneae. How you wish to associate that with every instance of “spider” or other names from the spider database is an implementation issue.

However, knowing “order Araneae” allows us to reliably distinguish all the instances of “spider” at this resource from other instances of “spider” that lack that characteristic.

Just as importantly, we only have to perform that task once. Not rely upon our users to perform that task over and over again.

The weakness of current indexing is that it harvests only the surface text and not the rich semantic soil in which it grows.

Advanced R programming

Saturday, November 23rd, 2013

Advanced R programming by Hadley Wickham.

From the webpage:

This is the in-progress book site for “Advanced R development“. The book is designed primarily for R users who want to improve their programming skills and understanding of the language. It should also be useful for programmers coming to R from other languages, as it explains some of R’s quirks and shows how some parts that seem horrible do have a positive side.

It will eventually be published as real books in Chapman and Hall’s R series. The final version of the book is due in December 2013, so it should be available in early 2014. Thanks to the publisher, the wiki will continue to be freely available after the book is published.

This is an early holiday present!

And/or news of a volume that should be on your wish list for next year.

I first saw this in a tweet by Michael Hausenblas.


Saturday, November 23rd, 2013

Introducing SAMOA, an open source platform for mining big data streams by Gianmarco De Francisci Morales and Albert Bifet.

From the post:

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.


SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

After you get SAMOA installed, you may want to read: Distributed Decision Tree Learning for Mining Big Data Streams by Arinto Murdopo (thesis).

The nature of streaming data prevents SAMOA from offering the range of machine learning algorithms common in machine learning packages.

But if the SAMOA algorithms fit your use cases, what other test would you apply?

Approaches to Backup and Disaster Recovery in HBase

Saturday, November 23rd, 2013

Approaches to Backup and Disaster Recovery in HBase by Clint Heath.

From the post:

With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.

In this post, you will get a high-level overview of the available mechanisms for backing up data stored in HBase, and how to restore that data in the event of various data recovery/failover scenarios. After reading this post, you should be able to make an educated decision on which BDR strategy is best for your business needs. You should also understand the pros, cons, and performance implications of each mechanism. (The details herein apply to CDH 4.3.0/HBase 0.94.6 and later.)

Note: At the time of this writing, Cloudera Enterprise 4 offers production-ready backup and disaster recovery functionality for HDFS and the Hive Metastore via Cloudera BDR 1.0 as an individually licensed feature. HBase is not included in that GA release; therefore, the various mechanisms described in this blog are required. (Cloudera Enterprise 5, currently in beta, offers HBase snapshot management via Cloudera BDR.)

The critical line in this post reads:

As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.

Note the emphasis on provide.

Great backup mechanisms don’t help much unless someone is making, testing and logging the backups.

Ask in writing about backups before performing any changes to a client’s system or data. Make the answer part of your documentation.


Saturday, November 23rd, 2013


From the webpage:

Frameless is an XSLT 2 processor running in the browser, directly written in JavaScript. It includes an XPath 2 query engine for simple, powerful querying. It works cross-browser, we have even reached compatibility with IE6 and Firefox 1.

With Frameless you’ll be able to do things the browsers won’t let you, such as using $variables and adding custom functions to XPath. What’s more, XPath 2 introduces if/else and for-loops. We’ll even let you use some XPath 3 functionality! Combine data into a string using the brand new string concatenation operator.

Use way overdue math functions such as sin() and cos(), essential when generating data-powered SVG graphics. And use to overcome the boundaries between XSLT and JavaScript.

When to use Frameless?

Frameless is created to simplify application development and is, due to its API, great for writing readable code.

It will make application development a lot easier and it’s a good fit for all CRUD applications and applications with tricky DOM manipulation.

Who will benefit by using it?

  • Designers and managers will be able to read the code and even fix some bugs.
  • Junior developers will get up to speed in no time and write code with a high level of abstraction, and they will be able to create prototypes that’ll be shippable.
  • Senior developers will be able to create complicated webapplications for all browsers and write them declaratively

What it’s not

Frameless doesn’t intend to fully replace functional DOM manipulation libraries like jQuery. If you like you can use such libraries and Frameless at the same time.

Frameless doesn’t provide a solution for cross-browser differences in external CSS stylesheets. We add prefixes to some inline style attributes, but you should not write your styles inline only for this purpose. We do not intend to replace any CSS extension language, such as for example Sass.

Frameless is very sparse on documentation but clearly the potential for browser-based applications is growing.

I first saw this in a tweet by Michael Kay.

Using AWS to Build a Graph-based…

Friday, November 22nd, 2013

Using AWS to Build a Graph-based Product Recommendation System by Andre Fatala and Renato Pedigoni.

From the description:

Magazine Luiza, one of the largest retail chains in Brazil, developed an in-house product recommendation system, built on top of a large knowledge Graph. AWS resources like Amazon EC2, Amazon SQS, Amazon ElastiCache and others made it possible for them to scale from a very small dataset to a huge Cassandra cluster. By improving their big data processing algorithms on their in-house solution built on AWS, they improved their conversion rates on revenue by more than 25 percent compared to market solutions they had used in the past.

Not a lot of technical details but a good success story to repeat if you are pushing graph-based services.

I first saw this in a tweet by Marko A. Rodriguez.

Data Scientist Foundations:…

Friday, November 22nd, 2013

Data Scientist Foundations: The Hard and Human Skills You Need

I didn’t see anything that wasn’t obvious if you stopped to think about it.

On the other hand, having a list will make it easier to identify gaps in your skills and/or training.

Something to keep in mind for the new year.

Make resolutions off of this list be the ones you really keep.

A names backbone:…

Friday, November 22nd, 2013

A names backbone: a graph of taxonomy by Nicky Nicolson.

At first glance a taxonomy paper but as you look deeper:

Slide 34: Concepts layer: taxonomy as a graph

  • Names are nodes
  • Typed, directed relationships represent synonymy and taxonomic placement
  • Evidence for taxonomic assertions provided as references
  • …and again, standards bases import / export using TCS

Slide 35 shows a synonym_of relationship between two name nodes.

Slide 36 shows evidence attached to placement at one node and for the synonym_of link.

Slide 37 shows reuse of nodes to support “different taxonomic opinions.”

Slide 39 Persistent identification of concepts

We can re-create a sub-graph representing a concept at a particular point in time using:

  1. Name ID
  2. Classification
  3. State

Users can link to a stable state of a concept

We can provide a feed of what has changed since

I mention this item in part because Peter Neubauer (Neo4j) suggested in an email that rather than “merging” nodes that subject sameness (my term, not his) could be represented as a relationship between nodes.

Much in the same way that synonym_of was represented in these slides.

And I like the documentation of the reason for synonymy.

The internal data format of Neo4j makes “merging” in the sense of creating one node to replace two or more other nodes impractical.

Perhaps replacing nodes with other nodes has practical limits?

Is “virtual merging” in your topic map future?

BinaryPig: Scalable Static Binary Analysis Over Hadoop

Friday, November 22nd, 2013

BinaryPig: Scalable Static Binary Analysis Over Hadoop (Guest post at Cloudera: Telvis Calhoun, Zach Hanif, and Jason Trost of Endgame)

From the post:

Over the past three years, Endgame received 40 million samples of malware equating to roughly 19TB of binary data. In this, we’re not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10 million samples in the last quarter of 2012. Its total corpus is estimated to be about 100 million samples. VirusTotal receives between 300,000 and 600,000 unique files per day, and of those roughly one-third to half are positively identified as malware (as of April 9, 2013).

This huge volume of malware offers both challenges and opportunities for security research, especially applied machine learning. Endgame performs static analysis on malware in order to extract feature sets used for performing large-scale machine learning. Since malware research has traditionally been the domain of reverse engineers, most existing malware analysis tools were designed to process single binaries or multiple binaries on a single computer and are unprepared to confront terabytes of malware simultaneously. There is no easy way for security researchers to apply static analysis techniques at scale; companies and individuals that want to pursue this path are forced to create their own solutions.

Our early attempts to process this data did not scale well with the increasing flood of samples. As the size of our malware collection increased, the system became unwieldy and hard to manage, especially in the face of hardware failures. Over the past two years we refined this system into a dedicated framework based on Hadoop so that our large-scale studies are easier to perform and are more repeatable over an expanding dataset.

To address this problem, we created an open source framework, BinaryPig, built on Hadoop and Apache Pig (utilizing CDH, Cloudera’s distribution of Hadoop and related projects) and Python. It addresses many issues of scalable malware processing, including dealing with increasingly large data sizes, improving workflow development speed, and enabling parallel processing of binary files with most pre-existing tools. It is also modular and extensible, in the hope that it will aid security researchers and academics in handling ever-larger amounts of malware.

For more details about BinaryPig’s architecture and design, read our paper from Black Hat USA 2013 or check out our presentation slides. BinaryPig is an open source project under the Apache 2.0 License, and all code is available on Github.

You may have heard the rumor that storing more than seven (7) days of food marks you as a terrorist in the United States.

Be forewarned: Doing Massive Malware Analsysis May Make You A Terrorist Suspect.

The “storing more than seven (7) days of food” rumor originated with Rand Paul R-Kentucky.

The Community Against Terrorism FBI flyer, assuming the pointers I found are accurate, says nothing about how many days of food you have on hand.

Rather it says:

Make bulk purchases of items to include:

Meals Ready to Eat

That’s an example of using small data analysis to disprove a rumor.

Unless you are an anthropologist, I would not rely on data from CSpan2.

Spark and Elasticsearch

Friday, November 22nd, 2013

Spark and Elasticsearch by Barnaby Gray.

From the post:

If you work in the Hadoop world and have not yet heard of Spark, drop everything and go check it out. It’s a really powerful, intuitive and fast map/reduce system (and some).

Where it beats Hadoop/Pig/Hive hands down is it’s not a massive stack of quirky DSLs built on top of layers of clunky Java abstractions – it’s a simple, pure Scala functional DSL with all the flexibility and succinctness of Scala. And it’s fast, and properly interactive – query, bam response snappiness – not query, twiddle fingers, wait a bit.. response.

And if you’re into search, you’ll no doubt have heard of Elasticsearch – a distributed restful search engine built upon Lucene.

They’re perfect bedfellows – crunch your raw data and spit it out into a search index ready for serving to your frontend. At the company I work for we’ve built the google-analytics-esque part of our product around this combination.

It so fast, it flies – we can process raw event logs at 250,000 events/s without breaking a sweat on a meagre EC2 m1.large instance. (bold emphasis added)

Don’t you just hate it when bloggers hold back? 😉

I’m not endorsing this solution but I do appreciate a post with attitude and useful information.


DataCleaner 3.5.7 released

Friday, November 22nd, 2013

DataCleaner 3.5.7 released

A point release but I haven’t mentioned DataCleaner since before version 2.4. Sorry.

True, DataCleaner doesn’t treat all information structures as subjects, etc., but then you don’t need a topic map for every data handling job.

Opps! I don’t think I was supposed to say that. 😉

Seriously, you need to evaluate every data technology and/or tool on the basis of your requirements.

Topic maps included.

Harry Potter (Neo4j GraphGist)

Friday, November 22nd, 2013

Harry Potter (Neo4j GraphGist)

From the webpage:

v0 of this graph models some of Harrys friends, enemies and their parents. Also have some pets and a few killings. The obvious relation missing is the one between Harry Potter and Voldemort- it took us 7 books to figure that one out, so you’ll have to wait till I add more data 🙂

Great start on a graph representation of Harry Potter!

But the graph model has a different perspective than Harry or others the book series had.

Harry Potter model

I’m a Harry Potter fan. When Harry Potter and the Philosopher’s Stone starts, Harry doesn’t know Ron Weasley, Hermione Granger, Voldemort, or Hedwig.

The graph presents the vantage point of an omniscience observer, who knows facts the rest of us waited seven (7) volumes to discover.

A useful point of view, but it doesn’t show how knowledge and events unfolded to the characters in the story.

We loose any tension over whether Harry will choose Cho Chang or Ginny Weasley

And certainly the outcomes for Albus Dumbledore and Serverus Snape lose their rich texture.

If you object that I am confusing a novel with a graph, are you saying a graph cannot represent the development of information over time?*

That’s a fairly serious shortcoming for any information representation technique.

In stock trading, for example, when I “knew” your shaving lotion causes “purple pustules spelling PIMP” to break out on an user’s face would be critically important.

Did I know before or after I unloaded my shares in your company? 😉

A silly example but illustrates that “when” we know information can be very important.

Not to mention that “static” data is only an illusion of our information systems. Or rather information systems that don’t allow for tracking changing information.

Is your information system one of those?

* I’m in the camp that thinks graphs can represent the development of information over time. Depends on your use case whether you need the extra machinery that enables time-based views.

The granularity of time requirements vary when you are talking about Harry Potter versus the Divine Comedy versus leaks from the current White House.

In topic maps, the range of validity for an association was called its “scope.” Scope and time needs more than one or two other posts.

Mathematica on every Raspberry Pi?

Thursday, November 21st, 2013

Putting the Wolfram Language (and Mathematica) on Every Raspberry Pi by Stephen Wolfram.

From the post:

Last week I wrote about our large-scale plan to use new technology we’re building to inject sophisticated computation and knowledge into everything. Today I’m pleased to announce a step in that direction: working with the Raspberry Pi Foundation, effective immediately there’s a pilot release of the Wolfram Language—as well as Mathematica—that will soon be bundled as part of the standard system software for every Raspberry Pi computer.

Wolfram Language and Mathematica now free on Raspberry Pi
In effect, this is a technology preview: it’s an early, unfinished, glimpse of the Wolfram Language. Quite soon the Wolfram Language is going to start showing up in lots of places, notably on the web and in the cloud. But I’m excited that the timing has worked out so that we’re able to give the Raspberry Pi community—with its emphasis on education and invention—the very first chance to put the Wolfram Language into action.

I’m a great believer in the importance of programming as a central component of education. And I’m excited that with the Wolfram Language I think we finally have a powerful programming language worthy of the next generation. We’ve got a language that’s not mostly concerned with the details of computers, but is instead about being able to understand and create things on the basis of huge amounts of built-in computational ability and knowledge.

It’s tremendously satisfying—and educational. Writing a tiny program, perhaps not even a line long, and already having something really interesting happen. And then being able to scale up larger and larger. Making use of all the powerful programming paradigms that are built into the Wolfram Language.


From the Raspberry PI blog:

Future Raspbian images will ship with the Wolfram Language and Mathematica by default; existing users with at least 600MB of free space on their SD card can install them today by typing:

sudo apt-get update && sudo apt-get install wolfram-engine

You’ll find Mathematica in the app launcher under the Education menu.


Wolfram Language Documentation.

Preliminary version is online only. Will take a bit of navigating to get a sense of the language as a whole.

Do you think Raspberry Pi sales are about to go through the roof? ;-)

…US spy R&D isn’t keeping pace

Thursday, November 21st, 2013

Panel warns that US spy R&D isn’t keeping pace

From the post:

U.S. technological superiority in areas like cyber security and cryptography are eroding along with other segments of that nation’s R&D base, a new report warns.

In its study released on Nov. 5, the National Commission for the Review of the Research and Development of program of the U.S. Intelligence Community found that unnamed adversaries are leveraging their science and technology R&D in areas critical to intelligence gathering and cyber security. Among the areas identified in an unclassified report were cryptography, cyber attack and defense along with “all-source data analytics,” or data mining.

“Exacerbating these challenges are U.S. policies that weaken the U.S. R&D talent base,” the report warned. “As scientific and technical knowledge and the resulting economic growth spread around the world, the competition for R&D talent is increasingly global.”
Among the panel’s recommendations were shoring up the U.S. R&D base through focused investment while improving coordination of R&D activities within the U.S. intelligence community.

The panel also found that U.S. spy agencies are failing to exploit non-military R&D that could be used to discern enemy capabilities and intentions as well as to counter the theft of American intellectual property.


National Commission for the Review of the Research and Development of program of the U.S. Intelligence Community

is the name on the cover of the report.

The entire file is forty-five (45) pages but only fifteen (15) have any new and/or substantive content.

Use this report for buzz words or phrases to use in proposals or comments.

The oddest point made by the commission reads:

The global spread of scientific and technical knowledge challenges U.S. national security. It threatens to erode essential capabilities of the U.S. Intelligence Community (IC) and the strength of the U.S. R&D base. (emphasis added)

A very sad view of the world.

Country N can’t be secure if scientific and technical knowledge spreads.

Greater equality in knowledge and technology could slow some of the adventurism that threatens the world as a whole.