Archive for February, 2015

Start of a new era: Apache HBase™ 1.0

Wednesday, February 25th, 2015

Start of a new era: Apache HBase™ 1.0

From the post:

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, we look at the past, present and future of Apache HBase project. 

The 1.0.0 release has three goals:

1) to lay a stable foundation for future 1.x releases;

2) to stabilize running HBase cluster and its clients; and

3) make versioning and compatibility dimensions explicit 

Seven (7) years is a long time so kudos to everyone who contributed to getting Apache HBase to this point!

For those of you who like documentation, see the Apache HBase™ Reference Guide.

Black Site in USA – Location and Details

Tuesday, February 24th, 2015

The disappeared: Chicago police detain Americans at abuse-laden ‘black site’ by Spencer Ackerman.

From the post:

The Chicago police department operates an off-the-books interrogation compound, rendering Americans unable to be found by family or attorneys while locked inside what lawyers say is the domestic equivalent of a CIA black site.

The facility, a nondescript warehouse on Chicago’s west side known as Homan Square, has long been the scene of secretive work by special police units. Interviews with local attorneys and one protester who spent the better part of a day shackled in Homan Square describe operations that deny access to basic constitutional rights.

Alleged police practices at Homan Square, according to those familiar with the facility who spoke out to the Guardian after its investigation into Chicago police abuse, include:

  • Keeping arrestees out of official booking databases.
  • Beating by police, resulting in head wounds.
  • Shackling for prolonged periods.
  • Denying attorneys access to the “secure” facility.
  • Holding people without legal counsel for between 12 and 24 hours, including people as young as 15.

At least one man was found unresponsive in a Homan Square “interview room” and later pronounced dead.

And it gets worse, far worse.

It is a detailed post but merits a slow read, particularly the statement by Jim Trainum, a former DC homicide detective:

“I’ve never known any kind of organized, secret place where they go and just hold somebody before booking for hours and hours and hours. That scares the hell out of me that that even exists or might exist,” said Trainum, who now studies national policing issues, to include interrogations, for the Innocence Project and the Constitution Project.

If a detective who lived with death and violence on a day to day basis is frightened of police black sites, what should our reaction be?

MILJS : Brand New JavaScript Libraries for Matrix Calculation and Machine Learning

Tuesday, February 24th, 2015

MILJS : Brand New JavaScript Libraries for Matrix Calculation and Machine Learning by Ken Miura, et al.


MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest JavaScript benchmark. Based on Sushi, a machine learning library called Tempura is provided, which supports various algorithms widely used in machine learning research. We also provide Soba as a visualization library. The implementations of our libraries are clearly written, properly documented and thus can are easy to get started with, as long as there is a web browser. These libraries are available from this http URL under the MIT license.

Where “this http URL” = It’s a hyperlink with that text in the original so I didn’t want to change the surface text.

The paper is a brief introduction to the JavaScript Libraries and ends with several short demos.

On this one, yes, run and get the code:

Happy coding!

Using NLP to measure democracy

Tuesday, February 24th, 2015

Using NLP to measure democracy by Thiago Marzagão.


This paper uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS are based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS are replicable and have standard errors small enough to actually distinguish between cases.

The ADS are produced with supervised learning. Three approaches are tried: a) a combination of Latent Semantic Analysis and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; and c) the Wordscores algorithm. The Wordscores algorithm outperforms the alternatives, so it is the one on which the ADS are based.

There is a web application where anyone can change the training set and see how the results change:

Automated Democracy Scores Part of the PhD work of Thiago Marzagão. An online interface that allows you to change democracy scores by the year and country and run the analysis against 200 billion data points on an Amazon cluster.

Quite remarkable although I suspect this level of PhD work and public access to it will grow rapidly in the near future.

Do read the paper and don’t jump straight to the data. 😉 Take a minute to see what results Thiago has reached thus far.

Personally I was expecting the United States and China to be running neck and neck. Mostly because the wealthy choose candidates for public office in the United States and in China the Party chooses them. Not all that different, perhaps a bit more formalized and less chaotic in China. Certainly less in the way of campaign costs. (humor)

I was seriously surprised to find that democracy was lowest in Africa and the Middle East. Evaluated on a national basis that may be correct but Western definitions aren’t easy to apply to Africa and the Middle East. Nation, Tribe and Ethnic Group in Africa And Democracy and Consensus in African Traditional Politics for one tip of the iceberg on decision making in Africa.

SocioViz (Danger?)

Tuesday, February 24th, 2015

SocioViz (Danger?)

From the website:

SocioViz is a social media analytics platform powered by Social Network Analysis metrics

Are you a Social Media Marketer, Digital Journalist or Social Researcher? Have a try and jump on board!

After you login, you give SocioViz access to your Twitter account and it generates a visual graph of your connections.

But there is no “about us” link. The tos (terms of service) and privacy link just reloads the login page. Only other links are to share SocioViz on a variety of social media sites. Quick search did not find any other significant information.


Sort of like Luke in the trash compactor, I have a very bad feeling about this. 😉

Anyone know more about this site?

I don’t like opaque social sites seeking access to my accounts. Maybe nothing but poor design but it is so far beyond the pale that I suspect a less generous explanation.

If you are feeling really risky, search for SocioViz, the site will turn up in the first few hits. I am reluctant to even repeat its address online.

TextBlob: Simplified Text Processing

Tuesday, February 24th, 2015

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.


  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

Has anyone compared this head to head with NLTK?

NkBASE distributed database (Erlang)

Tuesday, February 24th, 2015

NkBASE distributed database (Erlang)

From the webpage:

NkBASE is a distributed, highly available key-value database designed to be integrated into Erlang applications based on riak_core. It is one of the core pieces of the upcoming Nekso’s Software Defined Data Center Platform, NetComposer.

NkBASE uses a no-master, share-nothing architecture, where no node has any special role. It is able to store multiple copies of each object to achive high availabity and to distribute the load evenly among the cluster. Nodes can be added and removed on the fly. It shows low latency, and it is very easy to use.

NkBASE has some special features, like been able to work simultaneously as a eventually consistent database using Dotted Version Vectors, a strong consistent database and a eventually consistent, self-convergent database using CRDTs called dmaps. It has also a flexible and easy to use query language that (under some circunstances) can be very efficient, and has powerful support for auto-expiration of objects.

The minimum recommended cluster size for NkBASE is three nodes, but it can work from a single node to hundreds of them. However, NkBASE is not designed for very high load or huge data (you really should use the excellent Riak and Riak Enterprise for that), but as an in-system, flexible and easy to use database, useful in multiple scenarios like configuration, sessions, cluster coordination, catalogue search, temporary data, cache, field completions, etc. In the future, NetComposer will be able to start and manage multiple kinds of services, including databases like a full-blown Riak.

NkBASE has a clean code base, and can be used as a starting point to learn how to build a distributed Erlang system on top of riak_core, and to test new backends or replication mechanisms. NkBASE would have been impossible without the incredible work from Basho, the makers of Riak: riak_core, riak_dt and riak_ensemble.

Several things caught my attention about NkBASE.

That it is written in Erlang was the first thing.

That is is based on riak_core was the second thing.

But the thing that sealed it appearance here was:

NkBASE is not designed for very high load or huge data (you really should use the excellent Riak and Riak Enterprise for that)


A software description that doesn’t read like Topper in Dilbert?


See the GitHub page for all the details but this looks promising, for the right range of applications.

Imperiling Investigations With Secrecy

Tuesday, February 24th, 2015

Spy Cables expose ‘desperate’ US approach to Hamas by Will Jordan and Rahul Radhakrishnan.

From the post:

A CIA agent “desperate” to make contact with Hamas in Gaza pleaded for help from a South African spy in the summer of 2012, according to intelligence files leaked to Al Jazeera’s Investigative Unit. The US lists Hamas as a terrorist organisation and, officially at least, has no contact with the group.

That was just one of the revelations of extensive back-channel politicking involving the US, Israel and the Palestinian Authority as they navigate the Israeli-Palestinian conflict amid a stalled peace process.

Classified South African documents obtained by Al Jazeera also reveal an approach by Israel’s then-secret service chief, Meir Dagan, seeking Pretoria’s help in its efforts to scupper a landmark UN-authorised probe into alleged war crimes in Gaza, which was headed by South African judge Richard Goldstone.

Dagan explained that his effort to squelch the Goldstone Report had strong support from Palestinian Authority (PA) president Mahmoud Abbas.

The Mossad director told the South Africans that Abbas privately backed the Israeli position, saying he wanted the report rejected because he feared it would “play into the hands” of Hamas, his key domestic political rival.

The Spy Cables also reveal that US President Barack Obama made a direct threat to Abbas in hope of dissuading him from pursuing United Nations recognition for a Palestinian state.

In case you don’t know the “back story,” the Goldstone report was in its own words:

On 3 April 2009, the President of the Human Rights Council established the United Nations Fact Finding Mission on the Gaza Conflict with the mandate “to investigate all violations of international human rights law and international humanitarian law that might have been committed at any time in the context of the military operations that were conducted in Gaza during the period from 27 December 2008 and 18 January 2009, whether before, during or after.”

When produced, the report found there was evidence that both Hamas and Israel had committed war crimes. The chief judge, Richard Goldman, has subsequently stated the report would have been substantially different had information in the possession of Israel had been shared with the investigation. Specifically, Judge Goldman is satisfied that Israel did not target civilians as a matter of policy.

Israel has only itself to blame for the initial report reaching erroneous conclusions due to its failure to cooperate at all with the investigation. Secrecy and non-cooperation being their own reward in that case.

Even worse, however, is the revelation that the United States and others had no interest in whether Hamas or Israel had in fact committed war crimes but in how the politics of the report would impact their allies.

I can only imagine what the election results in the United States had Obama’s acceptance speech read in part:

I will build new partnerships to defeat the threats of the 21st century: terrorism and nuclear proliferation; poverty and genocide; climate change and disease.” I will stop any reports critical of Israel and scuttle any investigation into Israel’s conduct in Gaza or against the Palestinians. Helping our allies, Israel and at times the Palestinian Liberation Authority, will require that we turn a blind eye to potential war crimes and those who have committed them. I will engage in secret negotiations to protect anyone, no matter now foul, it is furthers the interest of the United States or one of its allies. In all those ways, “I will restore our moral standing, so that America is once again that last, best hope for all who are called to the cause of freedom, who long for lives of peace, and who yearn for a better future.” (The non-bolded text was added.)

Interesting to see how additional information shapes your reading of the speech isn’t it?

Transparent government isn’t a technical issue but a political one. Although I hasten to add that topic maps can assist with transparency, for governments so minded.

PS: Ones hopes that for any future investigations that Israel cooperates and the facts can be developed more quickly and openly.

Wiki New Zealand

Tuesday, February 24th, 2015

Wiki New Zealand

From the about page:

It’s time to democratise data. Data is a language in which few are literate, and the resulting constraints at an individual and societal level are similar to those experienced when the proportion of the population able to read was small. When people require intermediaries before digesting information, the capacity for generating insights is reduced.

To democratise data we need to put users at the centre of our models, we need to design our systems and processes for users of data, and we need to realise that everyone can be a user. We will all win when everyone can make evidence-based decisions.

Wiki New Zealand is a charity devoted to getting people to use data about New Zealand.

We do this by pulling together New Zealand’s public sector, private sector and academic data in one place and making it easy for people to use in simple graphical form for free through this website.

We believe that informed decisions are better decisions. There is a lot of data about New Zealand available online today, but it is too difficult to access and too hard to use. We think that providing usable, clear, digestible and unbiased information will help you make better decisions, and will lead to better outcomes for you, for your community and for New Zealand.

We also believe that by working together we can build the most comprehensive, useful and accurate representation of New Zealand’s situation and performance: the “wiki” part of the name means “collaborative website”. Our site is open and free to use for everyone. Soon, anyone will be able to upload data and make graphs and submit them through our auditing process. We are really passionate about engaging with domain and data experts on their speciality areas.

We will not tell you what to think. We present topics from multiple angles, in wider contexts and over time. All our data is presented in charts that are designed to be compared easily with each other and constructed with as little bias as possible. Our job is to present data on a wide range of subjects relevant to you. Your job is to draw your own conclusions, develop your own opinions and make your decisions.

Whether you want to make a business decision based on how big your market is, fact-check a newspaper story, put together a school project, resolve an argument, build an app based on clean public licensed data, or just get to know this country better, we have made this for you.

Isn’t New Zealand a post-apocalypse destination? Thinking however great it may be now, the neighborhood is going down when all the post-apocalypse folks arrive. Something on the order of Mr. Rogers Neighborhood to Max Max Beyond Thunderdome. 😉

Hopefully, if there is an apocalypse, it will happen quickly enough to prevent a large influx of undesirables into New Zealand.

I first saw this in a tweet by Neil Saunders.

Big data: too much information

Tuesday, February 24th, 2015

Big data: too much information by Joanna Goodman.

Joanna’s post was the source I used for part of the post Enhanced Access to UK Legislation. I wanted to call attention to her post because it covered more than just the site and offered several insights into the role of big data in law.

Consider Joanna’s list of ways big data can help with litigation:

Big data analysis – nine ways it can help

1 Big data analytics use algorithms to interrogate large volumes of unstructured, anonymised data to identify correlations, patterns and trends.

2 Has the potential to uncover patterns – and opportunities – that are not immediately obvious.

3 Graphics are key – visual representation is the only clear and comprehensive way to present the outcomes of big data analysis.

4 E-discovery is an obvious practical application of big data to legal practice, reducing the time and cost of trawling through massive volumes of structured and unstructured data held in different places.

5 Can identify patterns and trends, using client and case data, in dispute resolution to predict the probability of case outcomes. This facilitates decision-making – for example whether a claimant should pursue a case or to settle.

6 In the UK, the Big Data for Law project is digitising the entire statute book so that all UK legislation can be analysed, together with publicly available data from legal publishers. This will create the most comprehensive record of all UK legislation ever created together with analytical tools.

7 A law firm can use big data analytics to offer its insurance clients a service that identifies potentially fraudulent claims.

8 Big data will be usable as a design tool, to identify design patterns within statutes – combinations of rules that are used repeatedly to meet policy goals.

9 Can include transactional data and data from external sources, which can be cut in different ways.

Just as a teaser because the rest of her post is as interesting as what I quoted above, how would you use big data to shape debt collection practices?

See Joanna’s post to find out!

Enhanced Access to UK Legislation

Tuesday, February 24th, 2015

The site appears to be a standard legal access site, albeit from 1267 CE to present.

Its Changes made by legislation enacted from 2002 – present is useful but suffers from presenting changes to texts with tables.

Next year will be the 30th anniversary of the publication of ISO 8879, the SGML standard and texts are still aligned with tables to show changes. There have been better ways to present changes, HyTime, XML with XLink/Xpointer, and more recent additions to the XML family such as XPath and XQuery.

Inline presentation of changes with a navigation aid to choose the source of changes would be far more intuitive. Perhaps in a future version of this resource.

I say in a future version of this site because of the following description of the hopes for this site:

In 2014 big data moved centre stage, with the Arts and Humanities Research Council-funded Data for Law project, which was launched to facilitate socio-economic research and identify patterns in the way we legislate. This involves digitising the entire statute book and presenting the data so that all UK legislation can be analysed, together with publicly available data from legal publishers such as LexisNexis and Westlaw.

‘It’s better to count than to guess,’ observes David Howarth, a legal academic and former MP who co-leads the project. ‘Big Data for Law will provide the most comprehensive record of all UK legislation ever created, together with analytical tools. Although these will be most useful for the public sector, government, researchers and policy-makers, it is also useful to law firms, particularly those with public policy practices who will be able to gain insights into the changing regulatory environment.’

Firms and lawyers will be able to use the Big Data for Law resources to identify patterns and trends in particular areas of legislation. An important consideration is to publish the data in a form that will enable researchers to work with it effectively – and present their findings clearly.

Another part of the project uses big data as a design tool, to identify patterns within statutes – combinations of rules that are used repeatedly to meet policy goals. This is effectively extracting the structure of legislation and thinking about how it can be reused. Howarth suggests that law firms could also apply this approach to legal documents.

John Sheridan, head of legislation services for the National Archives and senior investigator for the Big Data for Law project, highlights the importance of creating tools and methods that are accessible to those without deep statistical knowledge as well as developing pre-packaged analysis that researchers and others can use and cite in documents: ‘For example, we can plot fluctuations in the number of laws made year on year, and in the length of the text.

‘We can discover how modular legislation is in particular areas and frequency of legislative change. We can also examine the language of law which uncovers the topics of the day and reflects major global events and political and economic trends.’

Mapping the statute book identifies commonly recurring themes and language patterns. Sheridan and his team have identified patterns for licensing, prohibition, regulators, tax and so on, with the purpose of enabling interested parties to quickly gain an understanding of a particular legislative development and identify commonly occurring solutions to particular issues.

The ability to search the entirety of UK legislation using time-based parameters and particular words and phrases makes it straightforward to find relevant laws pertaining to a specific issue, which is particularly useful to parties involved in litigation. Sheridan underlines the importance of visual presentation to plot patterns and trends, and the ability to drill down into the detail.

Isn’t that an extraordinary description of the potential for a public access to legislation site?

Patterns in legislation? Policy goals linked to external events? Language analysis?

I didn’t see any of those features, yet, but assume that they will be forthcoming.

Apache Solr 5.0 Highlights

Monday, February 23rd, 2015

Apache Solr 5.0 Highlights by Anshum Gupta.

From the post:

Usability improvements

A lot of effort has gone into making Solr more usable, mostly along the lines of introducing APIs and hiding implementation details for users who don’t need to know. Solr 4.10 was released with scripts to start, stop and restart Solr instance, 5.0 takes it further in terms of what can be done with those. The scripts now for instance, copy a configset on collection creation so that the original isn’t changed. There’s also a script to index documents as well as the ability to delete collections in Solr. As an example, this is all you need to do to start SolrCloud, index, browse through what’s been indexed, and clean up the collection.

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted
open http://localhost:8983/solr/gettingstarted/browse
bin/solr delete -c gettingstarted

Another important thing to note for new users is that Solr no longer has the default collection1 and instead comes with multiple example config-sets and data.

That is just a tiny part of Gupta’s post on just highlights of Apache Solr 5.0.

Download Apache Solr 5.0 to experience the improvements for yourself!

Download the improved Solr 5.0 Reference Manual as well!

Announcing Pulsar: Real-time Analytics at Scale

Monday, February 23rd, 2015

Announcing Pulsar: Real-time Analytics at Scale by Sharad Murthy and Tony Ng.

From the post:

We are happy to announce Pulsar – an open-source, real-time analytics platform and stream processing framework. Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation over time windows, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering. Pulsar scales to a million events per second with high availability. It can be easily integrated with metrics stores like Cassandra and Druid.


Why Pulsar

eBay provides a platform that enables millions of buyers and sellers to conduct commerce transactions. To help optimize eBay end users’ experience, we perform analysis of user interactions and behaviors. Over the past years, batch-oriented data platforms like Hadoop have been used successfully for user behavior analytics. More recently, we have newer use cases that demand collection and processing of vast numbers of events in near real time (within seconds), in order to derive actionable insights and generate signals for immediate action. Here are examples of such use cases:

  • Real-time reporting and dashboards
  • Business activity monitoring
  • Personalization
  • Marketing and advertising
  • Fraud and bot detection

We identified a set of systemic qualities that are important to support these large-scale, real-time analytics use cases:

  • Scalability – Scaling to millions of events per second
  • Latency – Sub-second event processing and delivery
  • Availability – No cluster downtime during software upgrade, stream processing rule updates , and topology changes
  • Flexibility – Ease in defining and changing processing logic, event routing, and pipeline topology
  • Productivity – Support for complex event processing (CEP) and a 4GL language for data filtering, mutation, aggregation, and stateful processing
  • Data accuracy – 99.9% data delivery
  • Cloud deployability – Node distribution across data centers using standard cloud infrastructure

Given our unique set of requirements, we decided to develop our own distributed CEP framework. Pulsar CEP provides a Java-based framework as well as tooling to build, deploy, and manage CEP applications in a cloud environment. Pulsar CEP includes the following capabilities:

  • Declarative definition of processing logic in SQL
  • Hot deployment of SQL without restarting applications
  • Annotation plugin framework to extend SQL functionality
  • Pipeline flow routing using SQL
  • Dynamic creation of stream affinity using SQL
  • Declarative pipeline stitching using Spring IOC, thereby enabling dynamic topology changes at runtime
  • Clustering with elastic scaling
  • Cloud deployment
  • Publish-subscribe messaging with both push and pull models
  • Additional CEP capabilities through Esper integration

On top of this CEP framework, we implemented a real-time analytics data pipeline.

That should be enough to capture your interest!

I saw it coming off of a two and one-half hour conference call. Nice way to decompress.

Other places to look:

If you don’t know Docker already, you will. Courtesy of the Pulsar Get Started page.

Nice to have yet another high performance data tool.

Google open sources a MapReduce framework for C/C++

Monday, February 23rd, 2015

Google open sources a MapReduce framework for C/C++ by Derrick Harris,

From the post:

Google announced on Wednesday that the company is open sourcing a MapReduce framework that will let users run native C and C++ code in their Hadoop environments. Depending on how much traction MapReduce for C, or MR4C, gets and by whom, it could turn out to be a pretty big deal.

Hadoop is famously, or infamously, written in Java and as such can suffer from performance issues compared with native C++ code. That’s why Google’s original MapReduce system was written in C++, as is the Quantcast File System, that company’s homegrown alternative for the Hadoop Distributed File System. And, as the blog post announcing MR4C notes, “many software companies that deal with large datasets have built proprietary systems to execute native code in MapReduce frameworks.”

Great news but be aware that “performance” is a tricky issue. If “performance” had a single meaning, the TIOBE Index for February 2015 (a rough gauge of programming language popularity) to look quite different over the years.

I remember a conference story where a programmer had written an application using Python, reasoning that resource limitations would compel the client to return for a fuller, enterprise solution. To their chagrin, the customer never exhausted the potential of the first solution. 😉

The Spy Cables: A glimpse into the world of espionage

Monday, February 23rd, 2015

The Spy Cables: A glimpse into the world of espionage by Al Jazeera Investigative Unit.

From the post:

A digital leak to Al Jazeera of hundreds of secret intelligence documents from the world’s spy agencies has offered an unprecedented insight into operational dealings of the shadowy and highly politicised realm of global espionage.

Over the coming days, Al Jazeera’s Investigative Unit is publishing The Spy Cables, in collaboration with The Guardian newspaper.

Spanning a period from 2006 until December 2014, they include detailed briefings and internal analyses written by operatives of South Africa’s State Security Agency (SSA). They also reveal the South Africans’ secret correspondence with the US intelligence agency, the CIA, Britain’s MI6, Israel’s Mossad, Russia’s FSB and Iran’s operatives, as well as dozens of other services from Asia to the Middle East and Africa.

You need to start hitting the Al Jazeera site on a regular basis.

Kudos to Al Jazeera for the ongoing release of these documents!

On the other hand, however, I am deeply disappointed by the editing of the documents to be released:

It has not been easy to decide which Spy Cables to publish, and hundreds will not be revealed.

After verifying the cables, we had to consider whether the publication of each document served the public interest, in consultation with industry experts, lawyers, and our partners at The Guardian. Regardless of any advice received, the decision to publish has been Al Jazeera’s alone.

We believe it is important to achieve greater transparency in the field of intelligence. The events of the last decade have shown that there has been inadequate scrutiny on the activities of agencies around the world. That has allowed some to act outside their own laws and, in some cases international law.

Publishing these documents, including operational and tradecraft details, is a necessary contribution to a greater public scrutiny of their activities.

The Spy Cables also reveal that in many cases, intelligence agencies are over-classifying information and hiding behind an unnecessary veil of secrecy. This harms the ability of a democratic society to either consent to the activities of their intelligence agencies or provide adequate checks and balances to their powers.

The Spy Cables are filled with the names, personal details, and pseudonyms of active foreign intelligence operatives who work undercover for the dozens of global spy agencies referenced in the files.

We confronted the possibility that publishing identities revealed in the cables could result in harm to potentially innocent people. We agreed that publishing the names of undercover agents would pose a substantial risk to potentially unwitting individuals from around the world who had associated with these agents.

We believe we can most responsibly accomplish our goal of achieving greater transparency without revealing the identities of undercover operatives.

For these reasons, we have redacted their names. We have also redacted sections that could pose a threat to the public, such as specific chemical formulae to build explosive devices.

Finally, some of the Spy Cables have been saved for future broadcast – ones that needed further contextualisation. Regardless of when we publish, the same considerations will inform our decisions over what to redact.

The line: “…we had to consider whether the publication of each document served the public interest…” captures the source of my disappointment.

The governments who sent the cables in question could and do argue in good faith that they “consider …. the public interest” in deciding which documents should be public and which should be private.

As the cables and prior leaks make clear, the judgement of governments about “the public interest” is deeply suspect and in the aftermath of major leaks, has been shown to be completely false. The world of diplomacy has not reached a fiery end nor have nations entered wars against every other nation. Everyone blushes for a bit and then moves on.

Although I like Al Jazeera and The Guardian better than most governments, why should I trust their judgement about what secrets the public is entitled to know more than the government’s? At least some governments are in theory answerable to their populations. Whereas news organizations are entirely self-anointed.

Having said that I am sure that Al Jazeera and The Guardian will do the best they can but why not trust the public with the information that after all affects them? I don’t think of the public as ill-mannered children who need to be protected from ugly truths. As far as “innocent lives,” I find it contradictory to speak of intelligence operatives and “innocent lives,” in the same conversation.

Having chosen to betray others in the service of goals of individuals in various governments, innocence isn’t a claim intelligence operatives can make.

Release the cables as obtained by Al Jazeera. Give the public an opportunity to make its own judgements based on all the evidence.

The Value of Leaks

Monday, February 23rd, 2015

The value of leaks of “secret” information cannot be over estimated.

The leaks by Edward Snowden haven’t changed the current practices of the U.S. government but they have sparked a lively debate over issues only a few suspected existed.

One specific advantage to the Snowden leaks is hopefully IT companies now realize that the government will betray them at a moment’s notice, such it be advantageous to do so. IT companies are far better off being loyal to their customer bases, as are their customers.

Another advantage to the Snowden leaks is an increased impetus for open source software. Not necessarily free software but open source so that a buyer can inspect the software for backdoors and other malware.

The most recent batch of leaks, the “Spy Cables,” appear to be of similar importance. Consider this current headline:

Mossad contradicted Netanyahu on Iran nuclear programme by Will Jordan, Rahul Radhakrishnan.

From the report:

Less than a month after Prime Minister Benjamin Netanyahu’s 2012 warning to the UN General Assembly that Iran was 70 per cent of the way to completing its “plans to build a nuclear weapon”, Israel’s intelligence service believed that Iran was “not performing the activity necessary to produce weapons”.

A secret cable obtained by Al Jazeera’s Investigative Unit reveals that Mossad sent a top-secret cable to South Africa on October 22, 2012 that laid out a “bottom line” assessment of Iran’s nuclear work.

It appears to contradict the picture painted by Netanyahu of Tehran racing towards acquisition of a nuclear bomb.

Writing that Iran had not begun the work needed to build any kind of nuclear weapon, the Mossad cable said the Islamic Republic’s scientists are “working to close gaps in areas that appear legitimate such as enrichment reactors”.

Such activities, however, “will reduce the time required to produce weapons from the time the instruction is actually given”.

The leaked information should (no guarantees) make it harder for Netanyahu to sell the U.S. Congress on something very foolish with regard to Iran and its nuclear energy program.

Just imagine how all the “scary” news would read if the public had full and free access to all the secret information circulated by governments and distorted for public consumption.

If you want a saner, better informed and safer world, leaking secret corporate and/or government documents is a step in that direction.

PS: Have you seen Snowden’s A Manifesto for the Truth?

A Gentle Introduction to Algorithm Complexity Analysis

Monday, February 23rd, 2015

A Gentle Introduction to Algorithm Complexity Analysis by Dionysis “dionyziz” Zindros.

From the post:

A lot of programmers that make some of the coolest and most useful software today, such as many of the stuff we see on the Internet or use daily, don’t have a theoretical computer science background. They’re still pretty awesome and creative programmers and we thank them for what they build.

However, theoretical computer science has its uses and applications and can turn out to be quite practical. In this article, targeted at programmers who know their art but who don’t have any theoretical computer science background, I will present one of the most pragmatic tools of computer science: Big O notation and algorithm complexity analysis. As someone who has worked both in a computer science academic setting and in building production-level software in the industry, this is the tool I have found to be one of the truly useful ones in practice, so I hope after reading this article you can apply it in your own code to make it better. After reading this post, you should be able to understand all the common terms computer scientists use such as “big O”, “asymptotic behavior” and “worst-case analysis”.

Do you nod when encountering “big O,” “asymptotic behavior” and “worst-case analysis” in CS articles?

Or do you understand what is meant when you encounter “big O,” “asymptotic behavior” and “worst-case analysis” in CS articles?

You are the only one who can say for sure. If it has been a while or you aren’t sure, this should act as a great refresher.

As an incentive, you can intimidate co-workers with descriptions of your code. 😉

I first saw this in a tweet by Markus Sagebiel.

Definition for “Violent Extremism?”

Monday, February 23rd, 2015

Two headlines skittered across my monitor today that illustrate the problem of defining “violent extremism.”

First, US-led air strikes on Syria ISIL targets ‘kill 1,600’

From the post:

US-led air strikes against the Islamic State of Iraq and the Levant (ISIL) group in Syria have killed more than 1,600 people since they began five months ago, a monitor said.

The Syrian Observatory for Human Rights said on Monday that almost all of those killed were fighters from ISIL and al-Qaeda’s Syrian affiliate al-Nusra Front, though it also documented the deaths of 62 civilians.

As I understand it, ISIL (IS) is trying to overthrow the government of Syria, lead by the son of Hafez al-Assad, Bashar al-Assad, who together have ruled Syria as dictators since 1970. That’s forty-five (45) years for those of you who are counting.

The abuses of the combined regimes are legendary. For example, Bashar al-Assad, if ever captured, will be charged with war crimes for events arising out of the Syrian Civil War. Syria also joined the United States in the unjust war against Iraq.

If I am reading the article correctly, the United States and its allies have killed at least 1,600 people who are trying to overthrow a dictator who is a known war criminal.


The other headline that caught my eye was: The White House Summit on Countering Violent Extremism, the press release reads in part:

This [week of 16 February 2015] week, the White House is convening a three-day summit on Countering Violent Extremism (CVE) to bring together local, federal, and international leaders – including President Obama and foreign ministers – to discuss concrete steps the United States and its partners can take to develop community-oriented approaches to counter hateful extremist ideologies that radicalize, recruit or incite to violence. Violent extremist threats can come from a range of groups and individuals, including domestic terrorists and homegrown violent extremists in the United States, as well as terrorist groups like al-Qaeda and ISIL.

A summit to counter: “…counter hateful extremist ideologies that radicalize, recruit or incite to violence.

It isn’t clear to me how the White House jumps from a group that is recruiting fighters to overthrow a dictator/known war criminal to “…hateful extremist ideologies that radicalize, recruit or incite to violence.”

By way of only three precedents (there are many others), the United States used violence to overthrow the governments of Afghanistan, Iraq and Libya. What makes the violence of ISIL, which is at least opposing a known war criminal and not a victim picked out of hat (Iraq), different from that of the United States?

Moreover, the violence by ISIL is at least in their own country, not half a world away, meddling in affairs that are really none of their business.

The fact that ISIL owns its violence, such as beheading people, may be physically repugnant but it is morally superior to the arcade style killing directed by faceless planners.

I think “violent extremism,” like “terrorist,” means: Someone the speaker dislikes, for any number of reasons.

Other definitions?

Category theory for beginners

Monday, February 23rd, 2015

Category theory for beginners by Ken Scrambler

From the post:

Explains the basic concepts of Category Theory, useful terminology to help understand the literature, and why it’s so relevant to software engineering.

Some two hundred and nine (209) slides, ending with pointers to other resources.

I would have dearly loved to see the presentation live!

This slide deck comes as close as any I have seen to teaching category theory as you would a natural language. Not too close but closer than others.

Think about it. When you entered school did the teacher begin with the terminology of grammar and how rules of grammar fit together?

Or, did the teacher start you off with “See Jack run.” or its equivalent in your language?

You were well on your way to being a competent language user before you were tasked with learning the rules for that language.

Interesting that the exact opposite approach is taken with category theory and so many topics related to computer science.

Pointers to anyone using a natural language teaching approach for category theory or CS material?

Integer sequence discovery from small graphs

Monday, February 23rd, 2015

Integer sequence discovery from small graphs by Travis Hoppe and Anna Petrone.


We have exhaustively enumerated all simple, connected graphs of a finite order and have computed a selection of invariants over this set. Integer sequences were constructed from these invariants and checked against the Online Encyclopedia of Integer Sequences (OEIS). 141 new sequences were added and 6 sequences were appended or corrected. From the graph database, we were able to programmatically suggest relationships among the invariants. It will be shown that we can readily visualize any sequence of graphs with a given criteria. The code has been released as an open-source framework for further analysis and the database was constructed to be extensible to invariants not considered in this work.

See also:

Encyclopedia of Finite Graphs “Set of tools and data to compute all known invariants for simple connected graphs”

Simple-connected-graph-invariant-database “The database file for the Encyclopedia of Finite Graphs (simple connected graphs up to order 10)”

From the paper:

A graph invariant is any property that is preserved under isomorphism. Invariants can be simple binary properties (planarity), integers (automorphism group size), polynomials (chromatic polynomials), rationals (fractional chromatic numbers), complex numbers (adjacency spectra), sets (dominion sets) or even graphs themselves (subgraph and minor matching).

Hmmm, perhaps an illustration, also from the paper, might explain better:


Figure 1: An example query to the Encyclopedia using speci c invariant conditions. The command python 10 -i is bipartite 1 -i is integral 1 -i is eulerian 1 displays the three graphs that are simultaneously bipartite, integral and Eulerian with ten vertices.

For analysis of “cells” of your favorite evil doers you can draw higgly-piggly graphs with nodes and arcs on a whiteboard or you can take advantage of formal analysis of small graphs, research on the organization of small groups, and the history of that particular group. With the higgly-piggly approach you risk missing connections that aren’t possible to represent from your starting point.

How gzip uses Huffman coding

Monday, February 23rd, 2015

How gzip uses Huffman coding by Julia Evans.

From the post:

I wrote a blog post quite a while ago called gzip + poetry = awesome where I talked about how the gzip compression program uses the LZ77 algorithm to identify repetitions in a piece of text.

In case you don’t know what LZ77 is (I sure didn’t), here’s the video from that post that gives you an example of gzip identifying repetitions in a poem!

Julia goes beyond the video to illustrate how Huffman encoding is used by gzip to compress a text.

She includes code, pointers to other resources, basically all you need to join her in exploring the topic at hand. An education style that many manuals and posts would do well to adopt.

Apache Lucene 5.0.0

Sunday, February 22nd, 2015

Apache Lucene 5.0.0

For the impatient:

Lucene CHANGES.txt

From the post:

Highlights of the Lucene release include:

Stronger index safety

  • All file access now uses Java’s NIO.2 APIs which give Lucene stronger index safety in terms of better error handling and safer commits.
  • Every Lucene segment now stores a unique id per-segment and per-commit to aid in accurate replication of index files.
  • During merging, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

  • Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index.
  • Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.
  • The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.
  • 5.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap.

Other features

  • FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.
  • Tokenizers and Analyzers no longer require Reader on init.
  • NormsFormat now gets its own dedicated NormsConsumer/Producer
  • SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene’s core.
  • PostingsFormat now uses a “pull” API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.
  • New DateRangeField type enables Indexing and searching of date ranges, particularly multi-valued ones.
  • A new ExitableDirectoryReader extends FilterDirectoryReader and enables exiting requests that take too long to enumerate over terms.
  • Suggesters from multi-valued field can now be built as DocumentDictionary now enumerates each value separately in a multi-valued field.
  • ConcurrentMergeScheduler detects whether the index is on SSD or not and does a better job defaulting its settings. This only works on Linux for now; other OS’s will continue to use the previous defaults (tuned for spinning disks).
  • Auto-IO-throttling has been added to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate.
  • CustomAnalyzer has been added that allows to configure analyzers like you do in Solr’s index schema. This class has a builder API to configure Tokenizers, TokenFilters, and CharFilters based on their SPI names and parameters as documented by the corresponding factories.
  • Memory index now supports payloads.
  • Added a filter cache with a usage tracking policy that caches filters based on frequency of use.
  • The default codec has an option to control BEST_SPEED or BEST_COMPRESSION for stored fields.
  • Stored fields are merged more efficiently, especially when upgrading from previous versions or using SortingMergePolicy

More goodness to start your week!

Apache Solr 5.0.0 and Reference Guide for 5.0 available

Sunday, February 22nd, 2015

Apache Solr 5.0.0 and Reference Guide for 5.0 available

For the impatient:


From the post::

Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world’s largest internet sites.

Solr 5.0 is available for immediate download at:

See the CHANGES.txt file included with the release for a full list of details.

Solr 5.0 Release Highlights:

  • Usability improvements that include improved bin scripts and new and restructured examples.
  • Scripts to support installing and running Solr as a service on Linux.
  • Distributed IDF is now supported and can be enabled via the config. Currently, there are four supported implementations for the same:

    • LocalStatsCache: Local document stats.
    • ExactStatsCache: One time use aggregation
    • ExactSharedStatsCache: Stats shared across requests
    • LRUStatsCache: Stats shared in an LRU cache across requests
  • Solr will no longer ship a war file and instead be a downloadable application.
  • SolrJ now has first class support for Collections API.
  • Implicit registration of replication,get and admin handlers.
  • Config API that supports paramsets for easily configuring solr parameters and configuring fields. This API also supports managing of pre-existing request handlers and editing common solrconfig.xml via overlay.
  • API for managing blobs allows uploading request handler jars and registering them via config API.
  • BALANCESHARDUNIQUE Collection API that allows for even distribution of custom replica properties.
  • There’s now an option to not shuffle the nodeSet provided during collection creation.
  • Option to configure bandwidth usage by Replication handler to prevent it from using up all the bandwidth.
  • Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud. This is also the default format for new Collections that would be created going forward.
  • timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient request retry.
  • pivot.facet results can now include nested stats.field results constrained by those pivots.
  • stats.field can be used to generate stats over the results of arbitrary numeric functions.
    It also allows for requesting for statistics for pivot facets using tags.
  • A new DateRangeField has been added for indexing date ranges, especially multi-valued ones.
  • Spatial fields that used to require units=degrees now take distanceUnits=degrees/kilometers miles instead.
  • MoreLikeThis query parser allows requesting for documents similar to an existing document and also works in SolrCloud mode.
  • Logging improvements:

    • Transaction log replay status is now logged
    • Optional logging of slow requests.

Solr 5.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Also available is the Solr Reference Guide for Solr 5.0. This 535 page PDF serves as the definitive user’s manual for Solr 5.0. It can be downloaded from the Apache mirror network:

This is the beginning of a great week!


Neo4j: Building a topic graph with Prismatic Interest Graph API

Sunday, February 22nd, 2015

Neo4j: Building a topic graph with Prismatic Interest Graph API by Mark Needham.

From the post:

Over the last few weeks I’ve been using various NLP libraries to derive topics for my corpus of How I met your mother episodes without success and was therefore enthused to see the release of Prismatic’s Interest Graph API.

The Interest Graph API exposes a web service to which you feed a block of text and get back a set of topics and associated score.

It has been trained over the last few years with millions of articles that people share on their social media accounts and in my experience using Prismatic the topics have been very useful for finding new material to read.

A great walk through from accessing the Interest Graph API to loading the data into Neo4j and querying it with Cypher.

I can’t profess a lot of interest in How I Met Your Mother episodes but the techniques can be applied to other content. 😉

The Complexity of Sequences Generated by the Arc-Fractal System

Sunday, February 22nd, 2015

The Complexity of Sequences Generated by the Arc-Fractal System by Hoai Nguyen Huynh, Andri Pradana, Lock Yue Chew.


We study properties of the symbolic sequences extracted from the fractals generated by the arc-fractal system introduced earlier by Huynh and Chew. The sequences consist of only a few symbols yet possess several nontrivial properties. First using an operator approach, we show that the sequences are not periodic, even though they are constructed from very simple rules. Second by employing the ϵ-machine approach developed by Crutchfield and Young, we measure the complexity and randomness of the sequences and show that they are indeed complex, i.e. neither periodic nor random, with the value of complexity measure being significant as compared to the known example of logistic map at the edge of chaos. The complexity and randomness of the sequences are then discussed in relation with the properties of associated fractal objects, such as their fractal dimension, symmetry and orientations of the arcs.

Very heavy sledding but I suspect worth the effort. Recalling the unexpected influence of fractals on computer science.

In any event, the mental exercise will do you good.

I first saw this in a tweet by Stefano Bertolo

Losing Your Right To Decide, Needlessly

Sunday, February 22nd, 2015

France asks US internet giants to ‘help fight terror’

From the post:

Twitter and Facebook spokespeople said they do everything they can to stop material that incites violence but didn’t say whether they would heed the minister’s request for direct cooperation with French authorities.

“We regularly host ministers and other governmental officials from across the world at Facebook, and were happy to welcome Mr Cazeneuve today,” a Facebook spokesperson said.

“We work aggressively to ensure that we do not have terrorists or terror groups using the site, and we also remove any content that praises or supports terrorism.”

Cazeneuve [interior minister, France] said he called on the tech companies to join in the fight against extremist propaganda disseminated on the internet and to block extremists’ ability to use websites and videos to recruit and indoctrinate new followers.

The pace of foreign fighters joining the Islamic State of Iraq and the Levant and other armed groups has not slowed and at least 3,400 come from Western nations among 20,000 from around the world, US intelligence officials say.

As regular readers you have already spotted what is missing in the social media = terrorist recruitment narrative.

One obvious missing part is the lack of evidence even of correlation between social media and terrorist recruitment. None, nada, nil, zip.

There are statements about social media by Brookings Institute expert J.M. Berger who used his testimony before Congress to flog his forthcoming book with Jessica Stern, “ISIS: The State of Terror,” and in a Brooking report to be released in March, 2015. His testimony is reported in: The Evolution of Terrorist Propaganda: The Paris Attack and Social Media, where he claims IS propaganda is present on Twitter, but fails to claim any correlation, much less causation for IS recruitment. It is just assumed.

You right to hear IS “propaganda,” if indeed it is “propaganda,” is being curtailed by the U.S. government, France, Twitter, Facebook, etc. Shouldn’t you be the one who gets to use the “off” switch as it is known to decide what you will or won’t read? As an informed citizen, shouldn’t you make your own judgements about the threat, if any, that IS poses to your country?

The other, perhaps not so obvious missing point is the significance of people traveling to support IS. Taking the reported numbers at face value:

at least 3,400 come from Western nations among 20,000 from around the world

Let’s put that into perspective. As of late Sunday afternoon on the East Coast of the United States, the world population stood at: 7,226,147,500.

That’s seven billion (with a “b”), two hundred and twenty-six million, one hundred and forty-seven thousand, five hundred people.

Subtract for that the alleged 20,000 who have joined IS and you get:

Seven billion (with a “b”), two hundred and twenty-six million, one hundred and twenty-seven thousand, five hundred people (7,226,127,500.)

Really? Twitter, Facebook, the United States, France and others are going to take freedom of speech and to be informed away from Seven billion (with a “b”), two hundred and twenty-six million, one hundred and twenty-seven thousand, five hundred people (7,226,127,500) because of the potential that social media may have affected some, but we don’t know how many, of 20,000 people?

Sometimes when you run the numbers, absurd policy choices show up to be just that, absurd.

PS: A more disturbing aspect of this story is that I have seen none of the major news outlets, The New York Times, CNN, Wall Street Journal, or even the Guardian, to question the casual connection between social media and recruitment for IS. If that were true, shouldn’t there be evidence to support such a claim?


Kathy Gilsinan’s Is ISIS’s Social-Media Power Exaggerated? (The Atlantic) confirms the social media impact of ISIS is on the minds of Western decision makers.

Exploiting the Superfish certificate

Sunday, February 22nd, 2015

Exploiting the Superfish certificate by Robert Graham.

All the major buzz feeds have been alive with Superfish gossip, excuses, accusations, etc.

Robert’s posts are a refreshing break from all the hand wringing on one side or the other.

Robert uses a $35 Raspberry Pi 2 in this post to setup the exploit. In a prior post, Extracting the SuperFish certificate, Robert details a useful exercise in cracking the encryption of of the SuperFish certificate.

If you want to avoid being deceived by government, industry or other sources on cybersecurity, you need to learn about cybersecurity.

Following Robert Graham is one way to start that education.

Companion to “Functional Programming in Scala”

Sunday, February 22nd, 2015

A companion booklet to “Functional Programming in Scala” by Rúnar Óli Bjarnason.

From the webpage:

This full colour syntax-highlighted booklet comprises all the chapter notes, hints, solutions to exercises, addenda, and errata for the book “Functional Programming in Scala” by Paul Chiusano and Runar Bjarnason. This material is freely available online, but is compiled here as a convenient companion to the book itself.

If you talk about supporting alternative forms of publishing, here is your chance to support an alternative form of publishing, financially.

Authors are going to gravitate to models that sustain their ability to write.

It is up to you what model that will be.

BigQuery [first 1 TB of data processed each month is free]

Sunday, February 22nd, 2015

BigQuery [first 1 TB of data processed each month is free]

Apologies if this is old news to you but I saw a tweet by GoogleCloudPlatform advertising the “first 1 TB of data processed each month is free” and felt compelled to pass it on.

Like so much news on the Internet, if it is “new” to us, we assume it must be “new” to everyone else. (That is how the warnings of malware that will alter your DNA spread.)

It is a very temping offer.

Temping enough that I am going to spend some serious time looking at BigQuery.

What’s your query for BigQuery?

Unleashing the Power of Data to Serve the American People

Sunday, February 22nd, 2015

Unleashing the Power of Data to Serve the American People by Dr. DJ Patil.

You can read (and listen) to Patil’s high level goals as the first ever U.S. Chief Data Scientist at his post.

His goals are too abstract and general to attract meaningful disagreement and that isn’t the purpose of this post.

I posted the link to his comments to urge you to contact Patil (or rather his office) with concrete plans for how his office can assist you in finding and using data. The sooner the better.

No doubt some areas are already off-limits for improved data access and some priorities are already set.

That said, contacting Patil before he and his new office have solidified in place can play an important role in establishing the scope of his office. On a lesser scale, the same situation that confronted George Washington as the first U.S. President. Nothing was set in stone and every act established a precedent for those who came after him.

Now is the time to press for an expansive and far reaching role for the U.S. Chief Data Scientist within the federal bureaucracy.