New Survey Technique! Ask Village Idiots

April 30th, 2015

I was deeply disappointed to see Scientific Computing with the headline: ‘Avengers’ Stars Wary of Artificial Intelligence by Ryan Pearson.

The respondents are all talented movie stars but acting talent and even celebrity doesn’t give them insight into issues such as artificial intelligence. You might as well ask football coaches about the radiation hazards of a possible mission to Mars. Football coaches, the winning ones anyway, are bright and intelligent folks, but as a class, aren’t the usual suspects to ask about inter-planetary radiation hazards.

President Reagan was known to confuse movies with reality but that was under extenuating circumstances. Confusing people acting in movies with people who are actually informed on a subject doesn’t make for useful news reporting.

Asking Chris Hemsworth who plays Thor in Avengers: Age of Ultron what the residents of Asgard think about relief efforts for victims of the recent earthquake in Nepal would be as meaningful.

They still publish the National Enquirer. A much better venue for “surveys” of the uninformed.

Pwning a thin client in less than two minutes

April 30th, 2015

Pwning a thin client in less than two minutes by Roberto Suggi Liverani

From the post:

Have you ever encountered a zero client or a thin client? It looks something like this…


f yes, keep reading below, if not, then if you encounter one, you know what you can do if you read below…

The model above is a T520, produced by HP – this model and other similar models are typically employed to support a medium/large VDI (Virtual Desktop Infrastructure) enterprise.

These clients run a Linux-based HP ThinPro OS by default and I had a chance to play with image version T6X44017 in particular, which is fun to play with it, since you can get a root shell in a very short time without knowing any password…

Normally, HP ThinPro OS interface is configured in a kiosk mode, as the concept of a thin/zero client is based on using a thick client to connect to another resource. For this purpose, a standard user does not need to authenticate to the thin client per se and would just need to perform a connection – e.g. VMware Horizon View. The user will eventually authenticate through the connection.

The point of this blog post is to demonstrate that a malicious actor can compromise such thin clients in a trivial and quick way provided physical access, a standard prerequisite in an attack against a kiosk.

During my testing, I have tried to harden as much as possible the thin client, with the following options:

Physical security is a commonly overlooked aspect of network security. That was true almost twenty (20) years ago when I was a Novell CNE and that hasn’t changed since. (Physical & Network Security: Better Together In 2014)

You don’t have to take my word for it. Take a walk around your office and see what network or cables equipment could be physically accessed for five minutes or less by any casual visitor. (Don’t forget unattended workstations.)

Don’t spend time and resources on popular “threats” such as China and North Korea when the pizza delivery guy can plug a wireless hub into an open Ethernet port inside your firewall. Yes?

For PR purposes the FBI would describe such a scheme as evidence of advanced networking and computer protocol knowledge. It may be from their perspective. ;-) It shouldn’t be from yours.

Some notes on why crypto backdoors are unreasonable

April 29th, 2015

Some notes on why crypto backdoors are unreasonable by Robert Graham.

From the post:

Robert gives a good summary of the usual arguments against crypto backdoors and then makes a new-to-me case against the FBI lobbying for such backdoors.

From the post:

Today’s testimony by the FBI and the DoJ discussed the tradeoffs between privacy and protection. Victims of crimes, those who get raped and murdered, deserve to have their killers brought to justice. That criminals get caught dissuades crime. Crypto makes prosecuting criminals harder.

That’s all true, and that’s certainly the argument victim rights groups should make when lobbying government. But here’s the thing: it’s not the FBI’s job to care. We, the people, make the decision about these tradeoffs. It’s solely we, the people, who are the constituents lobbying congress. The FBI’s job is to do what we tell them. They aren’t an interested party. Sure, it’s their job to stop crime, but it’s also their job to uphold rights. They don’t have an opinion, by definition, which one takes precedence over the other — congress makes that decision.

Yet, in this case, they do have an opinion. The only reason the subcommittee held hearings today is in response to the FBI lobbying for backdoors. Even if this issue were reasonable, it’s not reasonable that the FBI should lobby for it.

Where I depart from Robert are his concessions that there is a tradeoff between privacy and protection, that getting caught dissuades crime and that crypto makes prosecuting criminals more difficult.

Amy Hess, the executive assistant director of the FBI’s science and technology branch, testified:

It’s critical for police to “have the ability to accept or to receive the information that we might need in order to hold those accountable who conduct heinous crimes or conduct terrorist attacks,” (Government ‘backdoors’ to bypass encryption will make them vulnerable to attacks – industry experts)

The victims of crimes and prosecution arguments are entirely speculative. If there were a single concrete case where crypto either allowed the guilty to escape, that would be the first words out of Hess’ mouth. Law enforcement types would trot it out every day. Not even single case has come to light. There isn’t any balancing to do with the needs of law enforcement. They should come back when they can show real harm.

The other thing that prompted me to write was Robert saying that getting caught “dissuades crime.” Hardly, that’s the old canard about the death penalty being a deterrent to crime. Never has been the case that punishment deters crime. Even when hands were removed for theft.

The FBI has an interest to advance, for the same reason that it sets up emotionally disturbed young men to be busted for terrorist offenses. It has a budget and staff to maintain and you can’t do that without keeping yourself in the public eye. It also captures real criminals from time to time, but that is more of a sideline than its main purpose. Like all agencies and businesses, its main objection is its own preservation.

My disagreement with the FBI is over its use of fictional threats to deceive the public and its representatives for purposes that have nothing to do with the public good.

800,000 NPR Audio Files!

April 29th, 2015

There Are Now 800,000 Reasons To Share NPR Audio On Your Site by Patrick Cooper.

From the post:

From NPR stories to shows to songs, today we’re making more than 800,000 pieces of our audio available for you to share around the Web. We’re throwing open the doors to embedding, putting our audio on your site.

Complete with simple instructions for embedding!

I often think of topic maps when listening to NPR so don’t be surprised if you start seeing embedded NPR audio in the very near future!


[U.S.] House Member Data in XML

April 29th, 2015

User Guide and Data Dictionary. (In PDF)

From the Introduction:

The Office of the Clerk makes available membership lists for the U.S. House of Representatives. These lists are available in PDF format, and starting with the 114th Congress, the data is available in XML format. The document and data are available at

For unknown reasons, the link does not appear as a hyperlink in the guide.

Just as well because the link to the XML isn’t on that page anyway. Try: instead.

Looking forward to the day when all information generated by Congress being available in daily XML dumps.

MapR on Open Data Platform: Why we declined

April 29th, 2015

MapR on Open Data Platform: Why we declined by John Schroeder.

From the post:

Open Data Platform is “solving” problems that don’t need solving

Companies implementing Hadoop applications do not need to be concerned about vendor lock-in or interoperability issues. Gartner analysts Merv Adrian and Nick Heudecker disclosed in a recent blog that less than 1% of companies surveyed thought that vendor lock-in or interoperability was an issue—dead last on the list of customer concerns. Project and sub-project interoperability are very good and guaranteed by both free and paid-for distributions. Applications built on one distribution can be migrated with virtually zero switching costs to the other distributions.

Open Data Platform participation lacks participation by the Hadoop leaders

~75% of Hadoop implementations run on MapR and Cloudera. MapR and Cloudera have both chosen not to participate. The Open Data Platform without MapR and Cloudera is a bit like one of the Big Three automakers pushing for a standards initiative without the involvement of the other two.

I mention this post because it touches on two issues that should concern all users of Hadoop applications.

On “vendor lock-in” you will find the question that was asked was “…how many attendees considered vendor lock-in a barrier to investment in Hadoop. It came in dead last. With around 1% selecting it.” Who Asked for an Open Data Platform?. Considering that it was in the context of a Gartner webinar, it could have been only one person selected it. Not what I would call a representative sample.

Still, I think John in right in saying that vendor lock-in isn’t a real issue with Hadoop. Hadoop applications aren’t off the shelf items and are custom constructs for your needs and data. Not much opportunity for vendor lock-in. You’re in greater danger of IT lock-in due to poor or non-existent documentation for your Hadoop application. If anyone tells you a Hadoop application doesn’t need documentation because you can “…read the code…,” they are building up job security, quite possibly at your future expense.

John is spot on about the Open Data Platform not including all of the Hadoop market leaders. As John says, Open Data Platform does not include those responsible for 75% of the existing Hadoop implementations.

I have seen that situation before in standards work and it never leads to a happy conclusion, for the participants, non-participants and especially the consumers, who are supposed to benefit from the creation of standards. Non-standards for a minority of the market only serve to confuse not overly clever consumers. To say nothing of the popular IT press.

The Open Data Platform also raises questions about how one goes about creating a standard. One approach is to create a standard based on your projection of market needs and to campaign for its adoption. Another is to create a definition of an “ODP Core” and see if it is used by customers in development contracts and purchase orders. If consumers find it useful, they will no doubt adopt it as a de facto standard. Formalization can follow in due course.

So long as we are talking about possible future standards, a practice of documentation more advanced than C style comments for Hadoop ecosystems would be a useful Hadoop standard in the future.

No Incentives = No Improvement in Cybersecurity

April 29th, 2015

The State of Cybersecurity: Implications for 2015 (An ISACA and RSA Conference Survey) is now available.

It won’t take you long to conclude that the state of cybersecurity for 2015 and any year thereafter, is going to be about the same.

I say that because out of twenty-five (25) questions, only two (2) dealt with motivations and those were questions about motives for attacks (questions 9 and 10).

Changing the cybersecurity landscape, in favor of becoming more, not less secure will require:

  1. Discussion of positive incentives for greater security, more secure code, etc.
  2. Creation of positive incentives by government and industry for greater security, etc.
  3. Increases in security driven by sufficient incentives to produce greater security.

Think of security as a requirement. If you aren’t willing to pay for a requirement, why should anyone write software that meets that requirement?

Or to put it differently, you don’t have a right to be secure, but you should have the opportunity.

Present and Future of Big Data

April 28th, 2015

I thought you might find this amusing as a poster for your office.

Someday your grandchildren will find it similar to “The World of Tomorrow” at the 1939 World’s Fair.

Infographic: Big Data, present and future

Crowdsourcing Courses

April 28th, 2015

Kurt Luther is teaching a crowdsourcing course this Fall and has a partial list of crowdsourcing courses.

Any more to suggest?

Kurt tweets about crowdsourcing and history so you may want to follow him on Twitter.

Markov Composer

April 28th, 2015

Markov Composer – Using machine learning and a Markov chain to compose music by Andrej Budinčević.

From the post:

In the following article, I’ll present some of the research I’ve been working on lately. Algorithms, or algorithmic composition, have been used to compose music for centuries. For example, Western punctus contra punctum can be sometimes reduced to algorithmic determinacy. Then, why not use fast-learning computers capable of billions of calculations per second to do what they do best, to follow algorithms? In this article, I’m going to do just that, using machine learning and a second order Markov chain.

If you like exploring music with a computer, Andrej’s post will be a real treat!


MarkovComposer (GitHub)

I first saw this in a tweet by Debasish Ghosh.

One Word Twitter Search Advice

April 28th, 2015

The one word journalists should add to Twitter searches that you probably haven’t considered by Daniel Victor.

Daniel takes you through five results without revealing how he obtained them. A bit long but you will be impressed when he reveals the answer.

He also has some great tips for other Twitter searching. Tips that you won’t see from any SEO.

Definitely something to file with your Twitter search tips.

Hacking telesurgery robots, a concrete risk

April 27th, 2015

Hacking telesurgery robots, a concrete risk by Pierluigi Paganini.

From the post:

Technology will help humans to overwhelm any obstacle, one of them is the concept of space that for some activities could represent a serious problem. Let’s think for example to a life-saving surgery that could be performed by surgeons that are physically located everywhere in the world.

Telesurgery is a reality that could allow experts in one place controlling a robot in another that physically performs the surgical operation. The advantages are enormous in term of cost saving, and timely intervention of the medical staff, but which are the possible risks.

Telesurgery relies on sophisticated technology for computing, robotics and communications, and it’s easy to imagine the problem that could be caused by a threat actor.

The expert Tamara Bonaci and other colleagues at the University of Washington in Seattle have analyzed possible threats to the telesurgery, being focused on the possible cyber attacks that modify the behavior of a telerobot during surgery.

One more cyberinsecurity to add to the list!

Professional hand wringers can keep hand wringing, conference speakers can intone about the absolute necessity of better security, governments can keep buying surveillance as though it were security (yes, they both start with “s” but are not the same thing), corporations can keep evaluating cost versus the benefit of security and absent any effective incentives for cyber security, we will remain insecure.

Let me put it more bluntly: So long as cyber insecurity pays better than cyber security, cyber insecurity will continue to have the lead. Cyber security, for all of the talk and noise, is a boutique business compared to the business of cyber insecurity. How else would you explain the reported ten (10) year gap between defenders and hackers?

Government and corporate buyers could start us down the road to cyber security by refusing to purchase software that isn’t warranted to be free from buffer overflow conditions from outside input. (Not the only buffer overflow situation but an obvious one.) With warranties that have teeth in the event that such buffer overflow bugs are found.

The alternative is to have more pronouncements on the need for security, lots of papers on security, etc., and in 2016 and every year thereafter, there will be more vulnerabilities and less security than the year before. Your call.

DIY Security Fix For Android [43 year old vulnerability]

April 27th, 2015

Wi-Fi security software chokes on network names, opens potential hole for hackers by Paul Ducklin.

Paul details a bug that has been found in wpa_supplicant. The bug arises only when using Wi-Fi Direct, which is supported by Android. :(

The bug?, failure to check for buffer overflow. This must be what Dave Merkel, chief technology officer at IT security vendor FireEye, means by:

testing [software] for all things it shouldn’t do is an infinite, impossible challenge.

According to the Wikipedia article Buffer Overflow, buffer overflows were understood as 1972 and the first hostile use was in 1988. Those dates translate into forty-three (43) and twenty-seven (27) years ago.

Is it unreasonable to expect vulnerabilities known for forty-three (43) and used twenty-seven (27) years ago to be avoided in current programming practice?

This is the sort of issue where programming standards, along with legal liability as an incentive, could make a real difference.

If you are interested in knowing more about buffer overflows, see: Writing buffer overflow exploits – a tutorial for beginners.

Hijacking a Plane with Excel

April 26th, 2015

Wait! That’s not the right title! Hacking Airplanes by Bruce Schneier.

I was thinking about the Dilbert cartoon where the pointed haired boss tries to land a plane using Excel. ;-)

There are two points where I disagree with Bruce’s post, at least a little.

From the post:

Governments only have a fleeting advantage over everyone else, though. Today’s top-secret National Security Agency programs become tomorrow’s Ph.D. theses and the next day’s hacker’s tools. So while remotely hacking the 787 Dreamliner’s avionics might be well beyond the capabilities of anyone except Boeing engineers today, that’s not going to be true forever.

What this all means is that we have to start thinking about the security of the Internet of Things–whether the issue in question is today’s airplanes or tomorrow’s smart clothing. We can’t repeat the mistakes of the early days of the PC and then the Internet, where we initially ignored security and then spent years playing catch-up. We have to build security into everything that is going to be connected to the Internet.

First, I’m not so sure that only current Boeing engineers would be capable of hacking a 787 Dreamliner’s avionics. I don’t have a copy of it but I assume there are plenty of ex-Boeing engineers who may have a copy. And other people who could obtain a copy of it. Probably more of a lack of interest than access to the avionics code that explains why it hasn’t been hacked so far. If you want to crash an airline there are many easier methods than hacking its avionics code.

Second, I am far from convinced by Bruce’s argument:

We can’t repeat the mistakes of the early days of the PC and then the Internet, where we initially ignored security and then spent years playing catch-up.

Unless a rule against human stupidity was passed quite recently I don’t know of any reason why we won’t duplicate the mistakes of the early days of the PC and then of the Internet. Credit cards have been around far longer than both the PC and the Internet, yet fraud abounds in the credit card industry.

Do you remember: The reason companies don’t fix cybersecurity?

The reason why credit card companies don’t stop credit card fraud is that stopping it would cost more than the fraud. It isn’t a moral issue for them, it is a question of profit and loss. There is a point at which fraud becomes too costly and the higher cost of security is worth the cost.

For example, did you know at some banks that no check under $5,000.00 is ever inspected by anyone? Not even for signatures. It isn’t worth the cost of checking every item.

Security, at least for vendors, in the Internet of Things will be the same way. Security if and only if the cost of not having the security is justified against their bottom lines.

That plus human stupidity makes me think that cyber insecurity is here to stay.

PS: You should not attempt to hijack a plane with Excel. I don’t think your chances are all that good and the FBI and TSA (never having caught a hijacker yet), are warning airlines to be looking out for you. The FBI and TSA should be focusing on more likely threats, like hijacking a plane using telepathy.

New York Times Gets Stellarwind IG Report Under FOIA

April 26th, 2015

New York Times Gets Stellarwind IG Report Under FOIA by Benjamin Wittes.

A big thank you! to Benjamin Wittes and the New York Times.

They are the only two (2) stories on the Stellarwind IG report, released Friday evening, that give a link to the document!

The NYT story with the document: Government Releases Once-Secret Report on Post-9/11 Surveillance by Charlie Savage.

The document does not appear at:

Office of the Director of National Intelligence (as of Sunday, 25 April 2015, 17:45 EST).

US unveils 6-year-old report on NSA surveillance by Nedra Pickler (Associated Press or any news feed that parrots the Associated Press).

Suggestion: Don’t patronize news feeds that refer to documents but don’t include links to them.

NOAA weather data – Valuing Open Data – Guessing – History Repeats

April 26th, 2015

Tech titans ready their clouds for NOAA weather data by Greg Otto.

From the post:

It’s fitting that the 20 terabytes of data the National Oceanic and Atmospheric Administration produces every day will now live in the cloud.

The Commerce Department took a step Tuesday to make NOAA data more accessible as Commerce Secretary Penny Pritzker announced a collaboration among some of the country’s top tech companies to give the public a range of environmental, weather and climate data to access and explore.

Amazon Web Services, Google, IBM, Microsoft and the Open Cloud Consortium have entered into a cooperative research and development agreement with the Commerce Department that will push NOAA data into the companies’ respective cloud platforms to increase the quantity of and speed at which the data becomes publicly available.

“The Commerce Department’s data collection literally reaches from the depths of the ocean to the surface of the sun,” Pritzker said during a Monday keynote address at the American Meteorological Society’s Washington Forum. “This announcement is another example of our ongoing commitment to providing a broad foundation for economic growth and opportunity to America’s businesses by transforming the department’s data capabilities and supporting a data-enabled economy.”

According to Commerce, the data used could come from a variety of sources: Doppler radar, weather satellites, buoy networks, tide gauges, and ships and aircraft. Commerce expects this data to launch new products and services that could benefit consumer goods, transportation, health care and energy utilities.

The original press release has this cheery note on the likely economic impact of this data:

So what does this mean to the economy? According to a 2013 McKinsey Global Institute Report, open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide. If more of this data could be efficiently released, organizations will be able to develop new and innovative products and services to help us better understand our planet and keep communities resilient from extreme events.

Ah, yes, that would be the Open data: Unlocking innovation and performance with liquid information, on which the summary page says:

Open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors.

But you need to read the full report (PDF) in order to find footnote 3 on “economic value:”

3. Throughout this report we express value in terms of annual economic surplus in 2013 US dollars, not the discounted value of future cash flows; this valuation represents estimates based on initiatives where open data are necessary but not sufficient for realizing value. Often, value is achieved by combining analysis of open and proprietary information to identify ways to improve business or government practices. Given the interdependence of these factors, we did not attempt to estimate open data’s relative contribution; rather, our estimates represent the total value created.

That is a disclosure that the estimate of $3 to $5 trillion is a guess and/or speculation.

Odd how the guess/speculation disclosure drops out of the Commerce Department press release and when it gets to Greg’s story it reads:

open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, healthcare, and consumer finance sectors worldwide.

From guess/speculation to no mention to fact, all in the short space of three publications.

Does the valuing of open data remind you of:


(Image from:

The date of 1609 is important. Wikipedia has an article on Virginia, 1609-1610, titled, Starving Time. That year, only sixty (60) out of five hundred (500) colonists survived.

Does “Excellent Fruites by Planting” sound a lot like “new and innovative products and services?”

It does to me.

I first saw this in a tweet by Kirk Borne.

Getting Started with Spark (in Python)

April 26th, 2015

Getting Started with Spark (in Python) by Benjamin Bengfort.

From the post:

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

These two ideas have been the prime drivers for the advent of scaling analytics, large scale machine learning, and other big data appliances for the last ten years! However, in technology terms, ten years is an incredibly long time, and there are some well-known limitations that exist, with MapReduce in particular. Notably, programming MapReduce is difficult. You have to chain Map and Reduce tasks together in multiple steps for most analytics. This has resulted in specialized systems for performing SQL-like computations or machine learning. Worse, MapReduce requires data to be serialized to disk between each step, which means that the I/O cost of a MapReduce job is high, making interactive analysis and iterative algorithms very expensive; and the thing is, almost all optimization and machine learning is iterative.

To address these problems, Hadoop has been moving to a more general resource management framework for computation, YARN (Yet Another Resource Negotiator). YARN implements the next generation of MapReduce, but also allows applications to leverage distributed resources without having to compute with MapReduce. By generalizing the management of the cluster, research has moved toward generalizations of distributed computation, expanding the ideas first imagined in MapReduce.

Spark is the first fast, general purpose distributed computing paradigm resulting from this shift and is gaining popularity rapidly. Spark extends the MapReduce model to support more types of computations using a functional programming paradigm, and it can cover a wide range of workflows that previously were implemented as specialized systems built on top of Hadoop. Spark uses in-memory caching to improve performance and, therefore, is fast enough to allow for interactive analysis (as though you were sitting on the Python interpreter, interacting with the cluster). Caching also improves the performance of iterative algorithms, which makes it great for data theoretic tasks, especially machine learning.

In this post we will first discuss how to set up Spark to start easily performing analytics, either simply on your local machine or in a cluster on EC2. We then will explore Spark at an introductory level, moving towards an understanding of what Spark is and how it works (hopefully motivating further exploration). In the last two sections we will start to interact with Spark on the command line and then demo how to write a Spark application in Python and submit it to the cluster as a Spark job.

Be forewarned, this post uses the “F” word (functional) to describe the programming paradigm of Spark. Just so you know. ;-)

If you aren’t already using Spark, this is about as easy a learning curve as can be expected.


I first saw this in a tweet by DataMining.

pandas: powerful Python data analysis toolkit Release 0.16

April 25th, 2015

pandas: powerful Python data analysis toolkit Release 0.16 by Wes McKinney and PyData Development Team.

I mentioned Wes’ 2011 paper on pandas in 2011 and a lot has changed since then.

From the homepage:

pandas: powerful Python data analysis toolkit

PDF Version

Zipped HTML

Date: March 24, 2015 Version: 0.16.0

Binary Installers:

Source Repository:

Issues & Ideas:

Q&A Support:

Developer Mailing List:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

  • pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
  • pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
  • pandas has been used extensively in production in financial applications.


This documentation assumes general familiarity with NumPy. If you haven’t used NumPy much or at all, do invest some time in learning about NumPy first.

Not that I’m one to make editorial suggestions, ;-), but with almost 200 pages of What’s New entries going back to September of 2011 and topping out at over 1600 pages, I would move all but the latest What’s New to the end. Yes?

BTW, at 1600 pages, you may already be behind in your reading. Are you sure you want to get further behind?

Not only will the reading be entertaining, it will have the side benefit of improving your data analysis skills as well.


I first saw this mentioned in a tweet by Kirk Borne.

Mathematicians Reduce Big Data Using Ideas from Quantum Theory

April 24th, 2015

Mathematicians Reduce Big Data Using Ideas from Quantum Theory by M. De Domenico, V. Nicosia, A. Arenas, V. Latora.

From the post:

A new technique of visualizing the complicated relationships between anything from Facebook users to proteins in a cell provides a simpler and cheaper method of making sense of large volumes of data.

Analyzing the large volumes of data gathered by modern businesses and public services is problematic. Traditionally, relationships between the different parts of a network have been represented as simple links, regardless of how many ways they can actually interact, potentially loosing precious information. Only recently a more general framework has been proposed to represent social, technological and biological systems as multilayer networks, piles of ‘layers’ with each one representing a different type of interaction. This approach allows a more comprehensive description of different real-world systems, from transportation networks to societies, but has the drawback of requiring more complex techniques for data analysis and representation.

A new method, developed by mathematicians at Queen Mary University of London (QMUL), and researchers at Universitat Rovira e Virgili in Tarragona (Spain), borrows from quantum mechanics’ well tested techniques for understanding the difference between two quantum states, and applies them to understanding which relationships in a system are similar enough to be considered redundant. This can drastically reduce the amount of information that has to be displayed and analyzed separately and make it easier to understand.

The new method also reduces computing power needed to process large amounts of multidimensional relational data by providing a simple technique of cutting down redundant layers of information, reducing the amount of data to be processed.

The researchers applied their method to several large publicly available data sets about the genetic interactions in a variety of animals, a terrorist network, scientific collaboration systems, worldwide food import-export networks, continental airline networks and the London Underground. It could also be used by businesses trying to more readily understand the interactions between their different locations or departments, by policymakers understanding how citizens use services or anywhere that there are large numbers of different interactions between things.

You can hop over to Nature, Structural reducibility of multilayer networks, where if you don’t have an institutional subscription:

ReadCube: $4.99 Rent, $9.99 to buy, or Purchase a PDF for $32.00.

Let me save you some money and suggest you look at:

Layer aggregation and reducibility of multilayer interconnected networks


Many complex systems can be represented as networks composed by distinct layers, interacting and depending on each others. For example, in biology, a good description of the full protein-protein interactome requires, for some organisms, up to seven distinct network layers, with thousands of protein-protein interactions each. A fundamental open question is then how much information is really necessary to accurately represent the structure of a multilayer complex system, and if and when some of the layers can indeed be aggregated. Here we introduce a method, based on information theory, to reduce the number of layers in multilayer networks, while minimizing information loss. We validate our approach on a set of synthetic benchmarks, and prove its applicability to an extended data set of protein-genetic interactions, showing cases where a strong reduction is possible and cases where it is not. Using this method we can describe complex systems with an optimal trade–off between accuracy and complexity.

Both articles have four (4) illustrations. Same four (4) authors. The difference being the second one is at Oh, and it is free for downloading.

I remain concerned by the focus on reducing the complexity of data to fit current algorithms and processing models. That said, there is no denying that such reduction methods have proven to be useful.

The authors neatly summarize my concerns with this outline of their procedure:

The whole procedure proposed here is sketched in Fig. 1 and can be summarised as follows: i) compute the quantum Jensen-Shannon distance matrix between all pairs of layers; ii) perform hierarchical clustering of layers using such a distance matrix and use the relative change of Von Neumann entropy as the quality function for the resulting partition; iii) finally, choose the partition which maximises the relative information gain.

With my corresponding concerns:

i) The quantum Jensen-Shannon distance matrix presumes a metric distance for its operations, which may or may not reflect the semantics of the layers (or than by simplifying assumption).

ii) The relative change of Von Neumann entropy is a difference measurement based upon an assumed metric, which may or not represent the underlying semantics of the relationships between layers.

iii) The process concludes by maximizing a difference measurement based upon an assigned metric, which has been assigned to the different layers.

Maximizing a difference, based on an entropy calculation, which is itself based on an assigned metric doesn’t fill me with confidence.

I don’t doubt that the technique “works,” but doesn’t that depend upon what you think is being measured?

A question for the weekend: Do you think this is similar to the questions about dividing continuous variables into discrete quantities?

How to secure your baby monitor [Keep Out Creeps, FBI, NSA, etc.]

April 24th, 2015

How to secure your baby monitor by Lisa Vaas.

From the post:

Two more nurseries have been invaded, with strangers apparently spying on parents and their babies via their baby monitors.

This is nuts. We’re hearing more and more about these kinds of crimes, but there’s nothing commonplace about the level of fear they’re causing as families’ privacy is invaded. It’s time we put some tools into parents’ hands to help.

First, the latest creep-out cyber nursery tales. Read on to the bottom for ways to help keep strangers out of your family’s business.

I don’t know for a fact the FBI or NSA have tapped into baby monitors. But anyone who engages in an orchestrated campaign of false testimony in court spanning decades, lies to Congress (and the public), as well as kidnaps, tortures and executes people, well, my expectations aren’t all that high.

You really are entitled to privacy in your own home, especially with such a joyous occasion as the birth of a child. But that isn’t going to happen by default. Nor is the government going to guarantee that privacy. And it isn’t going to happen by default. Sorry.

You would not bath or dress your child in the front yard so don’t allow their room to become the front yard.

Teach your children good security habits along with looking both ways and holding hands to cross the street.

Almost all digitally recorded data is or can be compromised. That won’t change in the short run but we can create islands of privacy for our day to day lives. Starting with every child’s bedroom.

>30 Days From Patch – No Hacker Liability – Civil or Criminal

April 24th, 2015

Potent, in-the-wild exploits imperil customers of 100,000 e-commerce sites by Dan Goodin.

From the post:

Criminals are exploiting an extremely critical vulnerability found on almost 100,000 e-commerce websites in a wave of attacks that puts the personal information for millions of people at risk of theft.

The remote code-execution hole resides in the community and enterprise editions of Magento, the Internet’s No. 1 content management system for e-commerce sites. Engineers from eBay, which owns the e-commerce platform, released a patch in February that closes the vulnerability, but as of earlier this week, more than 98,000 online merchants still hadn’t installed it, according to researchers with Byte, a Netherlands-based company that hosts Magento-using websites. Now, the consequences of that inaction are beginning to be felt, as attackers from Russia and China launch exploits that allow them to gain complete control over vulnerable sites.

“The vulnerability is actually comprised of a chain of several vulnerabilities that ultimately allow an unauthenticated attacker to execute PHP code on the Web server,” Netanel Rubin, a malware and vulnerability researcher with security firm Checkpoint, wrote in a recent blog post. “The attacker bypasses all security mechanisms and gains control of the store and its complete database, allowing credit card theft or any other administrative access into the system.”

This flaw has been fixed but:

Engineers from eBay, which owns the e-commerce platform, released a patch in February that closes the vulnerability, but as of earlier this week, more than 98,000 online merchants still hadn’t installed it,…

The House of Representatives (U.S.) recently passed a cybersecurity bill to give companies liability protection while sharing threat data. As a step towards more sharing of cyberthreat information.

OK, but so far, have you heard of any incentives to encourage better security practices? Better security practices such as installing patches for known vulnerabilities.

Here’s an incentive idea for patch installation:

Exempt hackers from criminal and civil liability for vulnerabilities with patches more than thirty (30) days old.

Why not?

It will create a small army of hackers who pounce on every announced patch in hopes of catching someone over the thirty day deadline. It neatly solves the problem of how to monitor the installation of patches. (I am assuming the threat of being looted provides some incentive for patch maintenance.)

The second part should be a provision that insurance cannot be sold to cover losses due to hacks more than thirty days after patch release. As we have seen before, users rely on insurance to avoid spending money on cybersecurity. For more than thirty day after patch hacks, users have to eat the losses.

Let me know if you are interested in the >30-Day-From-Patch idea. I am willing to help draft the legislation.

For further information on this vulnerability:

Wikipedia on Magento, has about 30% of the ecommerce market.

Magento homepage, etc.

Analyzing the Magento Vulnerability (Updated) by Netanel Rubin.

From Rubin’s post:

Check Point researchers recently discovered a critical RCE (remote code execution) vulnerability in the Magento web e-commerce platform that can lead to the complete compromise of any Magento-based store, including credit card information as well as other financial and personal data, affecting nearly two hundred thousand online shops.

Check Point privately disclosed the vulnerabilities together with a list of suggested fixes to eBay prior to public disclosure. A patch to address the flaws was released on February 9, 2015 (SUPEE-5344 available here). Store owners and administrators are urged to apply the patch immediately if they haven’t done so already.
For a visual demonstration of one way the vulnerability can be exploited, please see our video here.

What kind of attack is it?

The vulnerability is actually comprised of a chain of several vulnerabilities that ultimately allow an unauthenticated attacker to execute PHP code on the web server. The attacker bypasses all security mechanisms and gains control of the store and its complete database, allowing credit card theft or any other administrative access into the system.

This attack is not limited to any particular plugin or theme. All the vulnerabilities are present in the Magento core, and affects any default installation of both Community and Enterprise Editions. Check Point customers are already protected from exploitation attempts of this vulnerability through the IPS software blade.

Rubin’s post has lots of very nice PHP code.

I first saw this in a tweet by Ciuffy.

Ordinary Least Squares Regression: Explained Visually

April 24th, 2015

Ordinary Least Squares Regression: Explained Visually by Victor Powell and Lewis Lehe.

From the post:

Statistical regression is basically a way to predict unknown quantities from a batch of existing data. For example, suppose we start out knowing the height and hand size of a bunch of individuals in a “sample population,” and that we want to figure out a way to predict hand size from height for individuals not in the sample. By applying OLS, we’ll get an equation that takes hand size—the ‘independent’ variable—as an input, and gives height—the ‘dependent’ variable—as an output.

Below, OLS is done behind-the-scenes to produce the regression equation. The constants in the regression—called ‘betas’—are what OLS spits out. Here, beta_1 is an intercept; it tells what height would be even for a hand size of zero. And beta_2 is the coefficient on hand size; it tells how much taller we should expect someone to be for a given increment in their hand size. Drag the sample data to see the betas change.

[interactive graphic omitted]

At some point, you probably asked your parents, “Where do betas come from?” Let’s raise the curtain on how OLS finds its betas.

Error is the difference between prediction and reality: the vertical distance between a real data point and the regression line. OLS is concerned with the squares of the errors. It tries to find the line going through the sample data that minimizes the sum of the squared errors. Below, the squared errors are represented as squares, and your job is to choose betas (the slope and intercept of the regression line) so that the total area of all the squares (the sum of the squared errors) is as small as possible. That’s OLS!

The post includes a visual explanation of ordinary least squares regression up to 2 independent variables (3-D).

Height wasn’t the correlation I heard with hand size but Visually Explained is a family friendly blog. And to be honest, I got my information from another teenager (at the time), so my information source is suspect.

jQAssistant 1.0.0 released

April 24th, 2015

jQAssistant 1.0.0 released by Dirk Mahler.

From the webpage:

We’re proud to announce the availability of jQAssistant 1.0.0 – lots of thanks go to all the people who made this possible with their ideas, criticism and code contributions!

Feature Overview

  • Static code analysis tool using the graph database Neo4j
  • Scanning of software related structures, e.g. Java artifacts (JAR, WAR, EAR files), Maven descriptors, XML files, relational database schemas, etc.
  • Allows definition of rules and automated verification during a build process
  • Rules are expressed as Cypher queries or scripts (e.g. JavaScript, Groovy or JRuby)
  • Available as Maven plugin or CLI (command line interface)
  • Highly extensible by plugins for scanners, rules and reports
  • Integration with SonarQube
  • It’s free and Open Source

Example Use Cases

  • Analysis of existing code structures and matching with proposed architecture and design concepts
  • Impact analysis, e.g. which test is affected by potential code changes
  • Visualization of architectural concepts, e.g. modules, layers and their dependencies
  • Continuous verification and reporting of constraint violations to provide fast feedback to developers
  • Individual gathering and filtering of metrics, e.g. complexity per component
  • Post-Processing of reports of other QA tools to enable refactorings in brown field projects
  • and much more…

Get it!

jQAssistant is available as a command line client from the downloadable distribution scan -f my-application.war analyze server

or as Maven plugin:


For a list of latest changes refer to the release notes, the documentation provides usage information.

Those who are impatient should go for the Get Started page which provides information about the first steps about scanning applications and running analysis.

Your Feedback Matters

Every kind of feedback helps to improve jQAssistant: feature requests, bug reports and even questions about how to solve specific problems. You can choose between several channels – just pick your preferred one: the discussion group, stackoverflow, a Gitter channel, the issue tracker, e-mail or just leave a comment below.


You want to get started quickly for an inventory of an existing Java application architecture? Or you’re interested in setting up a continuous QA process that verifies your architectural concepts and provides graphical reports?
The team of buschmais GbR offers individual workshops for you! For getting more information and setting up an agenda refer to (German) or just contact us via e-mail!

Short of wide spread censorship, in order for security breaches to fade from the news spotlight, software quality/security must improve.

jQAssistant 1.0.0 is one example of the type of tool required for software quality/security to improve.

Of particular interest is its use of Neo4j, enables having named relationships of materials to your code.

I don’t mean to foster the “…everything is a graph…” any more than I would foster “…everything is a set of relational tables…” or “…everything is a key/value pair…,” etc. Yes, but the question is: “What is the best way, given my requirements and constraints to achieve objective X?” Whether relationships are explicit, if so, what can I say about them?, or implicit, depends on my requirements, not those of a vendor.

In the case of recording who wrote the most buffer overflows and where, plus other flaws, tracking named relationships and similar information should be part of your requirements and graphs are a good way to meet that requirement.

Animation of Gerrymandering?

April 24th, 2015

United States Congressional District Shapefiles by Jeffrey B. Lewis, Brandon DeVine, and Lincoln Pritcher with Kenneth C. Martis.

From the description:

This site provides digital boundary definitions for every U.S. Congressional District in use between 1789 and 2012. These were produced as part of NSF grant SBE-SES-0241647 between 2009 and 2013.

The current release of these data is experimental. We have had done a good deal of work to validate all of the shapes. However, it is quite likely that some irregulaties remain. Please email with questions or suggestions for improvement. We hope to have a ticketing system for bugs and a versioning system up soon. The district definitions currently available should be considered an initial-release version.

Many districts were formed by aggregragating complete county shapes obtained from the National Historical Geographic Information System (NHGIS) project and the Newberry Library’s Atlas of Historical County Boundaries. Where Congressional district boundaries did not coincide with county boundaries, district shapes were constructed district-by-district using a wide variety of legal and cartographic resources. Detailed descriptions of how particular districts were constructed and the authorities upon which we relied are available (at the moment) by request and described below.

Every state districting plan can be viewed quickly at (clicking on any of the listed file names will create a map window that can be paned and zoomed). GeoJSON definitions of the districts can also be downloaded from the same URL. Congress-by-Congress district maps in ERSI ShapefileA format can be downloaded below. Though providing somewhat lower resolution than the shapefiles, the GeoJSON files contain additional information about the members who served in each district that the shapefiles do not (Congress member information may be useful for creating web applications with, for example, Google Maps or Leaflet).

Project Team

The Principal Investigator on the project was Jeffrey B. Lewis. Brandon DeVine and Lincoln Pitcher researched district definitions and produced thousands of digital district boundaries. The project relied heavily on Kenneth C. Martis’ The Historical Atlas of United States Congressional Districts: 1789-1983. (New York: The Free Press, 1982). Martis also provided guidance, advice, and source materials used in the project.

How to cite

Jeffrey B. Lewis, Brandon DeVine, Lincoln Pitcher, and Kenneth C. Martis. (2013) Digital Boundary Definitions of United States Congressional Districts, 1789-2012. [Data file and code book]. Retrieved from on [date of

An impressive resource for anyone interested in the history of United States Congressional Districts and their development. An animation of gerrymandering of congressional districts was the first use case that jumped to mind. ;-)


I first saw this in a tweet by Larry Mullen.

Are Government Agencies Trustworthy? FBI? No!

April 23rd, 2015

Pseudoscience in the Witness Box: The FBI faked an entire field of forensic science by Dahlia Lithwick.

From the post:

The Washington Post published a story so horrifying this weekend that it would stop your breath: “The Justice Department and FBI have formally acknowledged that nearly every examiner in an elite FBI forensic unit gave flawed testimony in almost all trials in which they offered evidence against criminal defendants over more than a two-decade period before 2000.”

What went wrong? The Post continues: “Of 28 examiners with the FBI Laboratory’s microscopic hair comparison unit, 26 overstated forensic matches in ways that favored prosecutors in more than 95 percent of the 268 trials reviewed so far.” The shameful, horrifying errors were uncovered in a massive, three-year review by the National Association of Criminal Defense Lawyers and the Innocence Project. Following revelations published in recent years, the two groups are helping the government with the country’s largest ever post-conviction review of questioned forensic evidence.

Chillingly, as the Post continues, “the cases include those of 32 defendants sentenced to death.” Of these defendants, 14 have already been executed or died in prison.

You should read Dahlia’s post carefully and then write “untrustworthy” next to any reference to or material from the FBI.

This particular issue involved identifying hair samples to be the same, which went beyond any known science.

But if 26 out of 28 experts were willing to go there, how far do you think the average agent on the street goes towards favoring the prosecution?

True, the FBI is working to find all the cases where this has happened, but questions about this type of evidence were raised long before now. But questioning the prosecution’s evidence doesn’t work in favor of the FBI.

Defense teams need to start requesting judicial notice of the propensity of executive branch department employees to give false testimony and a cautionary instruction to jurors in cases where they appear in trials.

Unker Non-Linear Writing System

April 23rd, 2015

Unker Non-Linear Writing System by Alex Fink & Sai.

From the webpage:


“I understood from my parents, as they did from their parents, etc., that they became happier as they more fully grokked and were grokked by their cat.”[3]

Here is another snippet from the text:

Binding points, lines and relations

Every glyph includes a number of binding points, one for each of its arguments, the semantic roles involved in its meaning. For instance, the glyph glossed as eat has two binding points—one for the thing consumed and one for the consumer. The glyph glossed as (be) fish has only one, the fish. Often we give glosses more like “X eat Y”, so as to give names for the binding points (X is eater, Y is eaten).

A basic utterance in UNLWS is put together by writing out a number of glyphs (without overlaps) and joining up their binding points with lines. When two binding points are connected, this means the entities filling those semantic roles of the glyphs involved coincide. Thus when the ‘consumed’ binding point of eat is connected to the only binding point of fish, the connection refers to an eaten fish.

This is the main mechanism by which UNLWS clauses are assembled. To take a worked example, here are four glyphs:


If you are interested in graphical representations for design or presentation, this may be of interest.

Sam Hunting forwarded this while we were exploring TeX graphics.

PS: The “cat” people on Twitter may appreciate the first graphic. ;-)

Protecting Your Privacy From The NSA?

April 23rd, 2015

House passes cybersecurity bill by Cory Bennett and Cristina Marcos.

From the post:

The House on Wednesday passed the first major cybersecurity bill since the calamitous hacks on Sony Entertainment, Home Depot and JPMorgan Chase.

Passed 307-116, the Protecting Cyber Networks Act (PCNA), backed by House Intelligence Committee leaders, would give companies liability protections when sharing cyber threat data with government civilian agencies, such as the Treasury or Commerce Departments.

“This bill will strengthen our digital defenses so that American consumers and businesses will not be put at the mercy of cyber criminals,” said House Intelligence Committee Chairman Devin Nunes (R-Calif.).

Lawmakers, government officials and most industry groups argue more data will help both sides better understand their attackers and bolster network defenses that have been repeatedly compromised over the last year.

Privacy advocates and a group of mostly Democratic lawmakers worry the bill will simply shuttle more sensitive information to the National Security Agency (NSA), further empowering its surveillance authority. Many security experts agree, adding that they already have the data needed to study hackers’ tactics.

The connection between sharing threat data and loss of privacy to the NSA escapes me.

At present, the NSA can or is:

  • Monitoring all Web traffic
  • Monitoring all Email traffic
  • Collecting all Phone metadata
  • Collecting all Credit Card information
  • Collecting all Social Media data
  • Collecting all Travel data
  • Collecting all Banking data
  • Has spied on Congress and other agencies
  • Can demand production of other information and records from anyone
  • Probably has a copy of your income tax and social security info

You are concerned private information about you might be leaked to the NSA in the form of threat data?


Anything is possible so something the NSA doesn’t already know could possibly come to light, but I would not waste my energy opposing a bill that is virtually no additional threat to privacy.

The NSA is the issue that needs to be addressed. Its very existence is incompatible with any notion of privacy.

NPR and The “American Bias”

April 23rd, 2015

Can you spot the “American bias” both in this story and the reporting by NPR?

U.S. Operations Killed Two Hostages Held By Al-Qaida, Including An American by Krishnadev Calamur:

President Obama offered his “grief and condolences” to the families of the American and Italian aid workers killed in a U.S. counterterrorism operation in January. Both men were held hostage by al-Qaida.

“I take full responsibility for a U.S. government counterterrorism operation that killed two innocent hostages held by al-Qaida,” Obama said.

The president said both Warren Weinstein, an American held by the group since 2011, and Giovanni Lo Porto, an Italian national held since 2012, were “devoted to improving the lives of the Pakistani people.”

Earlier Thursday, the White House in a statement announced the two deaths, along with the killings of two American al-Qaida members.

“Analysis of all available information has led the Intelligence Community to judge with high confidence that the operation accidentally killed both hostages,” the White House statement said. “The operation targeted an al-Qa’ida-associated compound, where we had no reason to believe either hostage was present, located in the border region of Afghanistan and Pakistan. No words can fully express our regret over this terrible tragedy.”

Exact numbers of casualties from American drone strikes are hard to come by but current estimates suggest that more people have died from drone attacks than in 9/11. A large number of those people were not the intended targets but civilians, including hundreds of children. A Bureau of Investigative Journalism report has spreadsheets you can download to find the specifics about drone strikes in particular countries.

Let’s pause to hear the Obama Administration’s “grief and condolences” over the deaths of civilians and children in each of those strikes:


That’s right, the Obama Administration has trouble admitting any civilians or children have died as a result of its drone war. Perhaps trying to avoid criminal responsibility for their actions. But it certainly has not expressed any “grief and condolences” over those deaths.

Jeff Bachman, of American University, estimates that between twenty-eight (28) and thirty-five (35) civilians die for every one (1) person killed on the Obama “kill” list in Pakistan alone. Drone Strikes: Are They Obama’s Enhanced Interrogation Techniques?

You will notice that NPR reporting does not contrast Obama’s “grief and condolences” for the deaths of two hostages (one of who was American) with his lack of any remorse over the deaths of civilians and children in other drone attacks.

Obama’s lack of remorse over the deaths of innocents in other drone attacks, reportedly isn’t unusual for war criminals. War criminals see their crimes as justified by the pursuit of a goal worth more than innocent human lives. Or in this case, more valuable than non-American innocent lives.

A Scary Earthquake Map – Oklahoma

April 22nd, 2015

Earthquakes in Oklahoma – Earthquake Map


Great example of how visualization can make the case that “standard” industry practices are in fact damaging the public.

The map is interactive and the screen shot above is only one example.

The main site is located at:

From the homepage:

Oklahoma experienced 585 magnitude 3+ earthquakes in 2014 compared to 109 events recorded in 2013. This rise in seismic events has the attention of scientists, citizens, policymakers, media and industry. See what information and research state officials and regulators are relying on as the situation progresses.

The next stage of data mapping should be identifying the owners or those who profited from the waste water disposal wells and their relationships to existing oil and gas interests, as well as their connections to members of the Oklahoma legislature.

What is it that Republicans call it? Ah, accountability, as in holding teachers and public agencies “accountable.” Looks to me like it is time to hold some oil and gas interests and their owners, “accountable.”

PS: Said to not be a “direct” result of fracking but of the disposal of water used for fracking. Close enough for my money. You?

Gathering, Extracting, Analyzing Chemistry Datasets

April 22nd, 2015

Activities at the Royal Society of Chemistry to gather, extract and analyze big datasets in chemistry by Antony Williams.

If you are looking for a quick summary of efforts to combine existing knowledge resources in chemistry, you can do far worse than Antony’s 118 slides on the subject (2015).

I want to call special attention to Slide 107 in his slide deck:


True enough, extraction is problematic, expensive, inaccurate, etc., all the things Antony describes. And I would strongly second all of what he implies is the better practice.

However, extraction isn’t just a necessity for today or for a few years, extraction is going to be necessary so long as we keep records about chemistry or any other subject.

Think about all the legacy materials on chemistry that exist in hard copy format just for the past two centuries. To say nothing of all of still older materials. It is more than unfortunate to abandon all that information simply because “modern” digital formats are easier to manipulate.

That was’t what Antony meant to imply but even after all materials have been extracted and exist in some form of digital format, that doesn’t mean the era of “extraction” will have ended.

You may not remember when atomic chemistry used “punch cards” to record isotopes:


An isotope file on punched cards. George M. Murphy J. Chem. Educ., 1947, 24 (11), p 556 DOI: 10.1021/ed024p556 Publication Date: November 1947.

Today we would represent that record in…NoSQL?

Are you confident that in another sixty-eight (68) years we will still be using NoSQL?

We have to choose from the choices available to us today, but we should not deceive ourselves into thinking our solution will be seen as the “best” solution in the future. New data will be discovered, new processes invented, new requirements will emerge, all of which will be clamoring for a “new” solution.

Extraction will persist as long as we keep recording information in the face of changing formats and requirements. We can improve that process but I don’t think we will ever completely avoid it.