Archive for the ‘Data Reduction’ Category

Mathematicians Reduce Big Data Using Ideas from Quantum Theory

Friday, April 24th, 2015

Mathematicians Reduce Big Data Using Ideas from Quantum Theory by M. De Domenico, V. Nicosia, A. Arenas, V. Latora.

From the post:

A new technique of visualizing the complicated relationships between anything from Facebook users to proteins in a cell provides a simpler and cheaper method of making sense of large volumes of data.

Analyzing the large volumes of data gathered by modern businesses and public services is problematic. Traditionally, relationships between the different parts of a network have been represented as simple links, regardless of how many ways they can actually interact, potentially loosing precious information. Only recently a more general framework has been proposed to represent social, technological and biological systems as multilayer networks, piles of ‘layers’ with each one representing a different type of interaction. This approach allows a more comprehensive description of different real-world systems, from transportation networks to societies, but has the drawback of requiring more complex techniques for data analysis and representation.

A new method, developed by mathematicians at Queen Mary University of London (QMUL), and researchers at Universitat Rovira e Virgili in Tarragona (Spain), borrows from quantum mechanics’ well tested techniques for understanding the difference between two quantum states, and applies them to understanding which relationships in a system are similar enough to be considered redundant. This can drastically reduce the amount of information that has to be displayed and analyzed separately and make it easier to understand.

The new method also reduces computing power needed to process large amounts of multidimensional relational data by providing a simple technique of cutting down redundant layers of information, reducing the amount of data to be processed.

The researchers applied their method to several large publicly available data sets about the genetic interactions in a variety of animals, a terrorist network, scientific collaboration systems, worldwide food import-export networks, continental airline networks and the London Underground. It could also be used by businesses trying to more readily understand the interactions between their different locations or departments, by policymakers understanding how citizens use services or anywhere that there are large numbers of different interactions between things.

You can hop over to Nature, Structural reducibility of multilayer networks, where if you don’t have an institutional subscription:

ReadCube: $4.99 Rent, $9.99 to buy, or Purchase a PDF for $32.00.

Let me save you some money and suggest you look at:

Layer aggregation and reducibility of multilayer interconnected networks


Many complex systems can be represented as networks composed by distinct layers, interacting and depending on each others. For example, in biology, a good description of the full protein-protein interactome requires, for some organisms, up to seven distinct network layers, with thousands of protein-protein interactions each. A fundamental open question is then how much information is really necessary to accurately represent the structure of a multilayer complex system, and if and when some of the layers can indeed be aggregated. Here we introduce a method, based on information theory, to reduce the number of layers in multilayer networks, while minimizing information loss. We validate our approach on a set of synthetic benchmarks, and prove its applicability to an extended data set of protein-genetic interactions, showing cases where a strong reduction is possible and cases where it is not. Using this method we can describe complex systems with an optimal trade–off between accuracy and complexity.

Both articles have four (4) illustrations. Same four (4) authors. The difference being the second one is at Oh, and it is free for downloading.

I remain concerned by the focus on reducing the complexity of data to fit current algorithms and processing models. That said, there is no denying that such reduction methods have proven to be useful.

The authors neatly summarize my concerns with this outline of their procedure:

The whole procedure proposed here is sketched in Fig. 1 and can be summarised as follows: i) compute the quantum Jensen-Shannon distance matrix between all pairs of layers; ii) perform hierarchical clustering of layers using such a distance matrix and use the relative change of Von Neumann entropy as the quality function for the resulting partition; iii) finally, choose the partition which maximises the relative information gain.

With my corresponding concerns:

i) The quantum Jensen-Shannon distance matrix presumes a metric distance for its operations, which may or may not reflect the semantics of the layers (or than by simplifying assumption).

ii) The relative change of Von Neumann entropy is a difference measurement based upon an assumed metric, which may or not represent the underlying semantics of the relationships between layers.

iii) The process concludes by maximizing a difference measurement based upon an assigned metric, which has been assigned to the different layers.

Maximizing a difference, based on an entropy calculation, which is itself based on an assigned metric doesn’t fill me with confidence.

I don’t doubt that the technique “works,” but doesn’t that depend upon what you think is being measured?

A question for the weekend: Do you think this is similar to the questions about dividing continuous variables into discrete quantities?

Data distillation with Hadoop and R

Tuesday, June 12th, 2012

Data distillation with Hadoop and R by David Smith.

From the post:

We’re definitely in the age of Big Data: today, there are many more sources of data readily available to us to analyze than there were even a couple of years ago. But what about extracting useful information from novel data streams that are often noisy and minutely transactional … aye, there’s the rub.

One of the great things about Hadoop is that it offers a reliable, inexpensive and relatively simple framework for capturing and storing data streams that just a few years ago we would have let slip though our grasp. It doesn’t matter what format the data comes in: without having to worry about schemas or tables, you can just dump unformatted text (chat logs, tweets, email), device “exhaust” (binary, text or XML packets), flat data files, network traffic packets … all can be stored in HDFS pretty easily. The tricky bit is making sense of all this unstructured data: the downside to not having a schema is that you can’t simply make an SQL-style query to extract a ready-to-analyze table. That’s where Map-Reduce comes in.

Think of unstructured data in Hadoop as being a bit like crude oil: it’s a valuable raw material, but before you can extract useful gasoline from Brent Sweet Light Crude or Dubai Sour Crude you have to put it through a distillation process in a refinery to remove impurities, and extract the useful hydrocarbons.

I may find this a useful metaphor because I grew up in Louisiana where land based oil wells were abundant and there was an oil reflinery only a couple of miles from my home.

Not a metaphor that will work for everyone but one you should keep in mind.

Lots of Copies – Keeps Stuff Safe / Is Insecure – LOCKSS / LOCII

Saturday, January 1st, 2011

Is your enterprise accidentally practicing Lots of Copies Keeps Stuff Safe – LOCKSS?

Gartner analyst Drue Reeves says:

Use document management to make sure you don’t have copies everywhere, and purge nonrelevant material.

If you fall into the lots of copies category your slogan should be: Lots of Copies Is Insecure or LOCII (pronounced “lossee”).

Not all document preservations solutions depend upon being insecure.

Topic maps can help develop strategies to make your document management solution less LOCII.

One way they can help is by mapping out all the duplicate copies. Are they really necessary?

Another way they can help is by showing who has access to each of those copies.

If you trust someone with access, that means you trust everyone they trust.

Check their Facebook or Linkedin pages to see how many other people you are trusting, just by trusting the first person.

Ask yourself: How bad would a Wikileaks like disclosure be?

Then get serious about information security and topic maps.

Data Reduction Technologies: What’s the Difference? – Podcast

Saturday, January 1st, 2011

Data Reduction Technologies: What’s the Difference?

A podcast I found at on data reduction, otherwise known as data deduplication.

You wouldn’t think that incremental backups would be news but apparently they are. The notion of making a backup incremental after it was a full backup (post-processing in HP speak) was new to me but backup technology isn’t one of my primary concerns.

Or hasn’t been I suppose I should say.

Incremental and de-duping for backups is well explored and probably needs no help from topic maps. At least per se.

The thought occurs to me that topic maps could assist in mapping advanced data reduction that doesn’t simply eliminate duplicate data within a single backup, but eliminates duplicate data across an enterprise.

Now that would be something different and perhaps a competitive advantage for any vendor wanting to pursue it.

Just in sketch form, create a topic map of the various backups and map the duplicate data across the backups. Then a pointer is created to the master backup that has the duplicate data. It could present to a client as though the data exists as part of the backup.

Using topic maps to reduce duplicate data backups, could strengthen encryption by reducing the footprint of data that has been encrypted. The less an encryption is used, the less data there is for analysis.

Those are nice ways to introduce enterprises to the advantages that topic maps can bring to information driven enterprises.


  1. Search for data reduction and choose one of the products to review.
  2. Would the suggested use of topic maps to de-dupe backups work for that product? Why/why not? (3-5 pages, citations)
  3. How would you use topic maps to provide a better, in your opinion) interface to standard backups? (3-5 pages, no citations)

PS: Yes, I know, I didn’t define standard backups in #3. Surprise me. Grab some free or demo backup software and report back with screen shots.