Archive for May, 2014

Introducing Hadoop FlipBooks

Saturday, May 31st, 2014

Introducing Hadoop FlipBooks

From the post:

In line with the learning theme that HadoopSphere has been evangelizing, we are pleased to introduce a new feature named FlipBooks. A Hadoop flipbook is a quick reference guide for any topic giving a short summary of key concepts in form of Q&A. Typically with a set of 4 questions, it tries to test your knowledge on the concept.

Curious what you think of this concept?

I looked at a couple of them but four (4) questions seems a bit short.

With the caution that it was probably twenty (20) years ago, I remember the drill software for the Novell Netware CNE program. Organized by subject/class as I recall and certainly a lot more than four (4) questions.

What software would you suggest for authoring similar drill material now?

Stop blaming spreadsheets…

Saturday, May 31st, 2014

Stop blaming spreadsheets (and take a good look in the mirror) by Felienne Hermans.

From the post:

This week, spreadsheets hit the news again, when data for a book written by economist Pikkety turned out to contain spreadsheet errors. On this, Daniele Lemire wrote a blog post warning people not to use spreadsheets for serious work. This is useless advice, let me explain why.

See Felienne’s post for the three reasons. She writes very well and I might mangle it trying to summarize.

I see Lemire’s complaint as similar to exhortations that users should be using Oxygen to create structured XML documents.

As opposed to using Open Office and styles to author complex documents in XML (unseen by the user).

You can guess which one authors more XML every day.

Users want technologies that help them accomplish day to day tasks. Successful software, like spreadsheets, takes that into account.

Open government:….

Saturday, May 31st, 2014

Open government: getting beyond impenetrable online data by Jed Miller.

From the post:

Mathematician Blaise Pascal famously closed a long letter by apologising that he hadn’t had time to make it shorter. Unfortunately, his pithy point about “download time” is regularly attributed to Mark Twain and Henry David Thoreau, probably because the public loves writers more than it loves statisticians. Scientists may make things provable, but writers make them memorable.

The World Bank confronted a similar reality of data journalism earlier this month when it revealed that, of the 1,600 bank reports posted online on from 2008 to 2012, 32% had never been downloaded at all and another 40% were downloaded under 100 times each.

Taken together, these cobwebbed documents represent millions of dollars in World Bank funds and hundreds of thousands of person-hours, spent by professionals who themselves represent millions of dollars in university degrees. It’s difficult to see the return on investment in producing expert research and organising it into searchable web libraries when almost three quarters of the output goes largely unseen.

You won’t find any ways to make documents less impenetrable in Jed’s post but it is a source for quotes on the issue.

For example:

For nonprofits and governments that still publish 100-page pdfs on their websites and do not optimise the content to share in other channels such as social: it is a huge waste of time and ineffective. Stop it now.

OK, so that’s easy: “Stop it now.”

The harder question: “What should we put in its place?”

Shouting “stop it” without offering examples of better documents or approaches, is like a car horn in New York City. It’s just noise pollution.

Do you have any examples of documents, standards, etc. that are “good” and non impenetrable?

Let’s make this more concrete: Suggest an “impenetrable” document*, hopefully not a one hundred (100) page one and I will take a shot at revising it to make it less “impenetrable.” I will post a revised version here with notes as to why revisions were made. We won’t all agree but it might result in a example document that isn’t “impenetrable.”

*Please omit tax statutes or regulations, laws, etc. I could probably make them less impenetrable but only with a great deal of effort. That sort of text is “impenetrable” by design.

Powers of Ten – Part I

Saturday, May 31st, 2014

Powers of Ten – Part I by Stephen Mallette.

From the post:

“No, no! The adventures first,’ said the Gryphon in an impatient tone: ‘explanations take such a dreadful time.”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is often quite simple to envision the benefits of using Titan. Developing complex graph analytics over a multi-billion edge distributed graph represent the adventures that await. Like the Gryphon from Lewis Carroll’s tale, the desire to immediately dive into the adventures can be quite strong. Unfortunately and quite obviously, the benefits of Titan cannot be realized until there is some data present within it. Consider the explanations that follow; they are the strategies by which data is bulk loaded to Titan enabling the adventures to ensue.

There are a number of different variables that might influence the approach to loading data into a graph, but the attribute that provides the best guidance in making a decision is size. For purposes of this article, “size” refers to the estimated number of edges to be loaded into the graph. The strategy used for loading data tends to change in powers of ten, where the strategy for loading 1 million edges is different than the approach for 10 million edges.

Given this neat and memorable way to categorize batch loading strategies, this two-part article outlines each strategy starting with the smallest at 1 million edges or less and continuing in powers of ten up to 1 billion and more. This first part will focus on 1 million and 10 million edges, which generally involves common Gremlin operations. The second part will focus on 100 million and 1 billion edges, which generally involves the use of Faunus.

Great guidance on loading relatively small data sets using Gremlin. Looking forward to seeing the harder tests with 100 million and 1 billion edge sets.

North American Slave Narratives

Saturday, May 31st, 2014

North American Slave Narratives

A listing of autobiographies in chronological order, starting from 1740 to 1999.

A total of two hundred and four (204) biographies and a large number of them are available online.

A class project to weave these together with court records, journals, newspapers and the like would be a good use case for topic maps.

Erin McKean, founder, Reverb

Saturday, May 31st, 2014

10 Questions: Erin McKean, founder, Reverb by Chanelle Bessette.

From the introduction to the interview:

At OUP, McKean began to question how effective paper dictionaries were for the English language. Every word is a data point that has no meaning unless it is put in context, she believed, and a digital medium was the only way to link them together. If the printed dictionary is an atlas, she reasoned, the digital dictionary is a GPS device.

McKean’s idea was to create an online dictionary, dubbed Wordnik, that not only defined words but also showed how words related to each other, thereby increasing the chance that people can find the exact word that they are looking for. Today, the technology behind Wordnik is used to power McKean’s latest company, Reverb. Reverb’s namesake product is a news discovery mobile application that recommends stories based on contextual clues in the words of the article. (Even if that word is “lexicography.”)

Another case where i need a mobile phone to view a new technology. 🙁

I ran across DARLING, which promises it isn’t ready to emulate an IPhone on Ubuntu.

Do you know of another iPhone emulator for Ubuntu?


Subtleties of Color

Saturday, May 31st, 2014

Simmon is an expert on Earth visualizations for NASA, although he starts off with a great story about an early Mariner image of Mars.

Great quote:

Color has an objective reality, but the colors we see are tricks of the imagination, and there is no perfectly objective view of color.

Interesting comments on use of the rainbow palette.

While searching for an identifier for the “rainbow palette, I found a blog entry to accompany this video: Subtleties of Color: The “Perfect” Palette.

In the video, pay particular attention to the impact of surrounding color on our perception of color.

Great introduction to the nuances of color! And it’s impact on the representation of your data.

Very useful if you want “details” you want to elide or “details” that you want to highlight.

I first saw this in a tweet by James Lane Conkling.

Conference on Weblogs and Social Media (Proceedings)

Saturday, May 31st, 2014

Proceedings of the Eighth International Conference on Weblogs and Social Media

A great collection of fifty-eight papers and thirty-one posters on weblogs and social media.

Not directly applicable to topic maps but social media messages are as confused, ambiguous, etc., as any area could be. Perhaps more so but there isn’t a reliable measure for semantic confusion that I am aware of to compare different media.

These papers may give you some insight into social media and useful ways for processing its messages.

I first saw this in a tweet by Ben Hachey.

[O]ne Billion Tweets

Saturday, May 31st, 2014

Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing by Narayanan Sundaram, et al.


Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.

In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of
> 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1–2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3:7x faster and query times that are 8:3x faster than a basic implementation.

In the introduction, the authors report “…typical queries taking 1-2.5ms. In comparison to other text search schemes, such as inverted indexes, our approach is an order of magnitude faster.”

I looked but did not find any open-source code for PLSH.

Caution: If you search for other research, the string “PLSH” is unlikely to be helpful. One my first search I found:

  • PL/sh is a procedural language handler for PostgreSQL
  • Partia Liberale Shqiptare (Albanian Liberal Party, Kosovo)
  • Pet Loss Support Hotline
  • Promised Land Spanish Horses
  • Polish courses (Abbreviation at Brown University)
  • Point Loma High School


Saturday, May 31st, 2014

Nicole White has authored an R driver for Neo4j known as Rneo4j.

To tempt one or more people into trying Rneo4j, two posts have appeared:

Demo of Rneo4j Part 1: Building a Database

Covers installation of the necessary R packages and the creation of a Twitter database for tweets containing “neo4j.”

Demo of Rneo4j Part 2: Plotting and Analysis

Uses Cypher results as an R data frame, which opens the data up to the full range of R analysis and display capabilities.

R users will find this a useful introduction to Neo4j and Neo4j users will be introduced to a new level of post-graph creation possibilities.

Functional Geekery

Friday, May 30th, 2014

Functional Geekery by Steve Proctor.

I stumbled across episode 9 of Functional Geekery (a podcast) in Clojure Weekly, May 29th, 2014 and was interested to hear the earlier podcasts.

It’s only nine other episodes and not a deep blog history but still, I thought it would be nice to have a single listing of all the episodes.

Do be aware that each episode has a rich set of links to materials mentioned/discussed in each podcast.

If you enjoy these podcasts, do be sure to encourage others to listen to them and encourage Steve to continue with his excellent work.

  • Episode 1 – Robert C. Martin

    In this episode I talk with Robert C. Martin, better known as Uncle Bob. We run the gamut from Structure and Interpretation of Computer Programs, introducing children to programming, TDD and the REPL, compatibility of Functional Programming and Object Oriented Programming

  • Episode 2 – Craig Andera

    In this episode I talk with fellow podcaster Craig Andera. We talk about working in Clojure, ClojureScript and Datomic, as well as making the transition to functional programming from C#, and working in Clojure on Windows. I also get him to give some recommendations on things he learned from guests on his podcast, The Cognicast.

  • Episode 3 – Fogus

    In this episode I talk with Fogus, author of The Joy of Clojure and Functional JavaScript. We cover his history with functional languages, working with JavaScript in a functional style, and digging into the history of software development.

  • Episode 4 – Zach Kessin

    In this episode I talk with fellow podcaster Zach Kessin. We cover his background in software development and podcasting, the background of Erlang, process recovery, testing tools, as well as profiling live running systems in Erlang.

  • Episode 5 – Colin Jones

    In this episode I talk with Colin Jones, software craftsman at 8th Light. We cover Colin’s work on the Clojure Koans, making the transition from Ruby to Clojure, how functional programming affects the way he does object oriented design now, and his venture into learning Haskell.

  • Episode 6 – Reid Draper

    In this episode I talk with Reid Draper. We cover Reid’s intro to functional programming through Haskell, working in Erlang, distributed systems, and property testing; including his property testing tool simple-check, which has since made it into a Clojure contrib project as test.check.

  • Episode 7 – Angela Harms and Jason Felice on avi

    In this episode I talk with Angela Harms and Jason Felice about avi. We talk about the motivation of a vi implementation written in Clojure, the road map of where avi might used, and expressivity of code.

  • Functional Geekery Episode 08 – Jessica Kerr

    In this episode I talk with Jessica Kerr. In this episode we talk bringing functional programming concepts to object oriented languages; her experience in Scala, using the actor model, and property testing; and much more!

  • Functional Geekery Episode 9 – William E. Byrd

    In this episode I talk with William Byrd. We talk about miniKanren and the differences between functional, logic and relational programming. We also cover the idea of thinking at higher levels of abstractions, and comparisons of relational programming to topics such as SQL, property testing, and code contracts.

  • Functional Geekery Episode 10 – Paul Holser

    In this episode I talk with Paul Holser. We start out by talking about his junit-quickcheck project, being a life long learner and exploring ideas about computation from other languages, and what Java 8 is looking like in with the support of closures and lambdas.


Hello Again

Friday, May 30th, 2014

We Are Now In Command of the ISEE-3 Spacecraft by Keith Cowing.

From the post:

The ISEE-3 Reboot Project is pleased to announce that our team has established two-way communication with the ISEE-3 spacecraft and has begun commanding it to perform specific functions. Over the coming days and weeks our team will make an assessment of the spacecraft’s overall health and refine the techniques required to fire its engines and bring it back to an orbit near Earth.

First Contact with ISEE-3 was achieved at the Arecibo Radio Observatory in Puerto Rico. We would not have been able to achieve this effort without the gracious assistance provided by the entire staff at Arecibo. In addition to the staff at Arecibo, our team included simultaneous listening and analysis support by AMSAT-DL at the Bochum Observatory in Germany, the Space Science Center at Morehead State University in Kentucky, and the SETI Institute’s Allen Telescope Array in California.

How’s that for engineering and documentation?

So, maybe good documentation isn’t such a weird thing after all. 😉

Anaconda 2.0

Friday, May 30th, 2014

Anaconda 2.0 by Corinna Bahr.

From the post:

We are pleased to announce Anaconda 2.0, the newest version of our enterprise-ready Python distribution. Available for free on Windows, Mac OS X and Linux, Anaconda includes almost 200 of the most popular numerical and scientific Python libraries used by scientists, engineers and data analysts, with an integrated and flexible installer.

From the Anaconda page:

Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing

  • 195+ of the most popular Python packages for science, math, engineering, data analysis
  • Completely free – including for commercial use and even redistribution
  • Cross platform on Linux, Windows, Mac
  • Installs into a single directory and doesn’t affect other Python installations on your system. Doesn’t require root or local administrator privileges
  • Stay up-to-date by easily updating packages from our free, online repository
  • Easily switch between Python 2.6, 2.7, 3.3, 3.4, and experiment with multiple versions of libraries, using our conda package manager and its great support for virtual environments
  • Comes with tools to connect and integrate with Excel


I first saw this in a tweet by Scientific Python

Realtime Personalization/Recommendataion

Friday, May 30th, 2014

Realtime personalization and recommendation with stream mining by Mikio L. Braun.

From the post:

Last Tuesday, I gave a talk at this year’s Berlin Buzzword conference on using stream mining algorithms to efficiently store information extracted from user behavior to perform personalization and recommendation effectively already using a single computer, which is of course key behind streamdrill.

If you’ve been following my talks, you’ll probably recognize a lot of stuff I’ve talked about before, but what is new in this talk is that I tried to take the next step from simply talking about Heavy Hitters and Count- Min Sketches to using these data structures as an approximate storage for all kinds of analytics related data like counts, profiles, or even sparse matrices, as they occur recommendations algorithms.

I think reformulating our approach as basically an efficient approximate data structure also helped to steer the discussion away from comparing streamdrill to other big data frameworks (“Can’t you just do that in Storm?” — “define ‘just’”). As I said in the talk, the question is not whether you can do it in Big Data Framework X, because you probably could. I have started look at it from the other direction: we did not use any Big Data framework and were still able to achieve some serious performance numbers.

Slides and video are available at this page.

Tablib: Pythonic Tabular Datasets

Friday, May 30th, 2014

Tablib: Pythonic Tabular Datasets by Kenneth Reitz and Bessie Monke.

From the post:

Tablib is an MIT Licensed format-agnostic tabular dataset library, written in Python. It allows you to import, export, and manipulate tabular data sets. Advanced features include, segregation, dynamic columns, tags & filtering, and seamless format import & export.

Definitely an add to your Python keychair USB drive.

I first saw this in a tweet by Gregory Piatetsky.

…Setting Up an R-Hadoop System

Friday, May 30th, 2014

Step-by-Step Guide to Setting Up an R-Hadoop System by Yanchang Zhao.

From the post:

This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps or settings might be different for Windows or Ubuntu.

What looks like an excellent post on installing R-Hadaoop. It is written for the Mac OS and I have yet to confirm its installation on either Windows or Ubuntu.

I won’t be installing this on Windows so if you can confirm any needed changes and post them I would appreciate it.

I first saw this in a tweet by Gregory Piatetsky.

BBC Radio Explorer:…

Friday, May 30th, 2014

BBC Radio Explorer: a new way to listen to radio by James Cridland.

From the post:

The BBC has quietly released a prototype service called BBC Radio Explorer.

The service is the result of “10% time”, a loose concept that allows the BBC’s software engineers time to develop and play about with things. Unusually, this one is visible to the public, if you know where to look. But, with a quiet announcement on Twitter and no press release, you’ll be forgiven to not know it exists. That’s by design: since it’s not finished: every page tells us it’s “work-in-progress”.

BBC Radio Explorer is a relatively simple idea. Type something that you’re interested in, and the service plays you clips and programmes that it thinks you’ll like: one after the other. It’s a different way to listen to the BBC’s speech radio output, and it should unearth a lot of interesting programming from the BBC.

Technically, it’s nicely done: type a topic, and it instantly starts playing some audio. The BBC’s invested some time in clipping some of their programmes into small chunks, and typically you’ll get a little bit of the Today programme, or BBC Radio 5 live’s breakfast show, as well as longer-form programmes. You can skip forward and back to different clips, and a quite clever progress bar shows you images of what’s coming up, while the current programme slowly disappears. It’s a responsive site, and apparently works well on iOS devices too, though Android support is lacking.

James compares similar services and discusses a number short-comings of the service.

An old and familiar one is the inadequacy of BBC Radio Explorer search capabilities. Not unique to the BBC but common across search engines everywhere.

But on the whole, James take this to be a worthwhile venture and I would have to agreed.

Unless and until users become more vocal about what is lacking in current search capabilities, business as usual will prevail as search engines tweak their results to sell more ads.

Apache™ Spark™ v1.0

Friday, May 30th, 2014

Apache™ Spark™ v1.0

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 170 Open Source projects and initiatives, announced today the availability of Apache Spark v1.0, the super-fast, Open Source large-scale data processing and advanced analytics engine.

Apache Spark has been dubbed a “Hadoop Swiss Army knife” for its remarkable speed and ease of use, allowing developers to quickly write applications in Java, Scala, or Python, using its built-in set of over 80 high-level operators. With Spark, programs can run up to 100x faster than Apache Hadoop MapReduce in memory.

“1.0 is a huge milestone for the fast-growing Spark community. Every contributor and user who’s helped bring Spark to this point should feel proud of this release,” said Matei Zaharia, Vice President of Apache Spark.

Apache Spark is well-suited for machine learning, interactive queries, and stream processing. It is 100% compatible with Hadoop’s Distributed File System (HDFS), HBase, Cassandra, as well as any Hadoop storage system, making existing data immediately usable in Spark. In addition, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box.

New in v1.0, Apache Spark offers strong API stability guarantees (backward-compatibility throughout the 1.X series), a new Spark SQL component for accessing structured data, as well as richer integration with other Apache projects (Hadoop YARN, Hive, and Mesos).

Spark Homepage.

A bit more technical note of the release from the project:

Spark 1.0.0 is a major release marking the start of the 1.X line. This release brings both a variety of new features and strong API compatibility guarantees throughout the 1.X line. Spark 1.0 adds a new major component, Spark SQL, for loading and manipulating structured data in Spark. It includes major extensions to all of Spark’s existing standard libraries (ML, Streaming, and GraphX) while also enhancing language support in Java and Python. Finally, Spark 1.0 brings operational improvements including full support for the Hadoop/YARN security model and a unified submission process for all supported cluster managers.

You can download Spark 1.0.0 as either a source package (5 MB tgz) or a prebuilt package for Hadoop 1 / CDH3, CDH4, or Hadoop 2 / CDH5 / HDP2 (160 MB tgz). Release signatures and checksums are available at the official Apache download site.

What a nice way to start the weekend!

I first saw this in a tweet by Sean Owen.

Debunking Linus’s Law with Science

Friday, May 30th, 2014

Putting the science in computer science by Felienne Hermans.

From the description:

Programmers love science! At least, so they say. Because when it comes to the ‘science’ of developing code, the most used tool is brutal debate. Vim versus emacs, static versus dynamic typing, Java versus C#, this can go on for hours at end. In this session, software engineering professor Felienne Hermans will present the latest research in software engineering that tries to understand and explain what programming methods, languages and tools are best suited for different types of development.

Felienne dispells the notion that a discipline is scientific because it claims “science” as part of its name.

To inject some “science” into “computer science,” she reports tests of several propositions, widely held in CS circles, that don’t bear up when “facts” are taken into account.

For example, Linus’s Law: “Given enough eyeballs, all bugs are shallow.”

“Debunking” may not be strong enough because as Felienne shows, the exact opposite of Linus’s Law is true: The more people who touch code, the more bugs are introduced.

If some proprietary software house rejoices over that fact, you can point out that complexity of the originating organization also has a direct relationship to bugs. As in more and not less bugs.

That’s what happens when you go looking for facts. Old sayings true out to be not true and people you already viewed with suspicion turned out to be more incompetent than you thought.

That’s science.

Balisage – Late Breaking News

Friday, May 30th, 2014

Balisage 2014 Call for Late-breaking News

Proposals due: June 13, 2014.

You have been drooling over the Preliminary Program for several days and wishing you had submitted a paper to Balisage.

Unlike some mistakes, you have a second chance to appear in the company of the markup stars listed in the program. Second chances in life are rare and I suggest you take advantage of this one.

From the announcement:

The peer-reviewed part of the Balisage 2014 program has been scheduled. A few slots on the Balisage program have been reserved for presentation of “Late-breaking” material.

Proposals for late-breaking slots must be received by June 13, 2014. Selection of late-breaking proposals will be made by the Balisage conference committee, instead of being made in the course of the regular peer-review process.

If you have a presentation that should be part of Balisage, and it isn’t already on the Preliminary Program, please send a proposal message as plain-text email to

In order to be considered for inclusion in the final program, your proposal message must supply the following information:

  • The name(s) and affiliations of all author(s)/speaker(s)
  • The email address of the presenter
  • The title of the presentation
  • An abstract of 100-150 words, suitable for immediate distribution
  • Disclosure of when and where, if some part of this material has already been presented or published
  • An indication as to whether the presenter is comfortable giving a conference presentation and answering questions in English about the material to be presented
  • Your assurance that all authors are willing and able to sign the Balisage Non-exclusive Publication Agreement ( with respect to the proposed presentation

In order to be in serious contention for inclusion in the final program, your proposal should probably be either a) really late-breaking (it happened in the last month or two) or b) a paper, an extended paper proposal, or a very long abstract with references. Late-breaking slots are few and the competition is fiercer than for peer-reviewed papers. The more we know about your proposal, the better we can appreciate the quality of your submission.

Please feel encouraged to provide any other information that could aid the conference committee as it considers your proposal, such as a detailed outline, samples, code, and/or graphics. We expect to receive far more proposals than we can accept, so it’s important that you send enough information to make your proposal convincing and exciting. (This material may be attached to the email message, if appropriate.)

The conference committee reserves the right to make editorial changes in your abstract and/or title for the conference program and publicity.

Balisage will be held in North Bethesda, Maryland (a suburb of Washington, DC).

So, no St. Catherine’s street. Sorry.

On the other hand, the yellow pages currently list thirty-four (34) “escort” services in the Washington, D.C. area. I don’t know of any price/service comparison listing for those services.

Quantum Computing Playground

Friday, May 30th, 2014

Google’s “Quantum Computing Playground” lets you fiddle with quantum algorithms by Dario Borghino.

From the post:

Google has just launched a new web-based integrated development environment (IDE) that allows users to write, run and debug software that makes use of quantum algorithms. The tool could allow computer scientists to stay ahead of the game and get acquainted with the many quirks of quantum algorithms even before the first practical quantum computer is built.

Homepage: Quantum Computing Playground.

From the homepage:

We strongly recommened to run Quantum Playground in Google Chrome.

I accessed the homepage using, gasp, another browser. 😉 Just happenstance.

The page doesn’t say anything about use of another browser leaving your computer out of phase but probably best not to take chances.


Open-Source Intelligence

Thursday, May 29th, 2014

Big data brings new power to open-source intelligence by Matthew Moran.

From the post:

In November 2013, the New Yorker published a profile of Eliot Higgins – or Brown Moses as he is known to almost 17,000 Twitter followers. An unemployed finance and admin worker at the time, Higgins was held up as an example of what can happen when we take advantage of the enormous amount of information being spread across the internet every day. The New Yorker’s eight-page spread described Higgins as “perhaps the foremost expert on the munitions used in the [Syrian] war”, a remarkable description for someone with no formal training in munitions or intelligence.

Higgins does not speak Arabic and has never been to the Middle East. He operates from his home in Leicester and, until recently, conducted his online investigations as an unpaid hobby. Yet the description was well-founded. Since starting his blog in 2012, Higgins has uncovered evidence of the Syrian army’s use of cluster bombs and exposed the transfer of weapons from Iran to Syria. And he has done it armed with nothing more than a laptop and an eye for detail.

This type of work is a form of open-source intelligence. Higgins exploits publicly accessible material such as online photos, video and social media updates to piece together information about the Syrian conflict. His analyses have formed the basis of reports in The Guardian and a blog for The New York Times, while his research has been cited by Human Rights Watch.

Matthew makes a compelling case for open-source intelligence, using Eliot Higgins as an example.

No guarantees of riches or fame but data is out there to be mined and curated.

All that is required is for you to find it, package it and find the right audience and/or buyer.

No small order but what else are you doing this coming weekend? 😉

PS: Where would you place requests for intelligence or offer intelligence for sale? Just curious.

Global Data of Events, Languages, and Tones

Thursday, May 29th, 2014

More than 250 million global events are now in the cloud for anyone to analyze be Derrick Harris.

From the post:

Georgetown University researcher Kalev Leetaru has spent years building the Global Database of Events, Languages, and Tones. It now contains data on more than 250 million events dating back to 1979 and updated daily, with 58 different fields apiece, across 300 categories. Leetaru uses it to produce a daily report analyzing global stability. He and others have used it to figure out whether the kidnapping of 200 Nigerian girls was a predictable event and watch Crimea turn into a hotspot of activity leading up to ex-Ukrainian Viktor Yanukovych’s ouster and Russia’s subsequent invasion.

“The idea of GDELT is how do we create a catalog, essentially, of everything that’s going on across the planet, each day,” Leetaru explained in a recent interview.

And now all of it is available in the cloud, for free, for anybody to analyze as they desire. Leetaru has partnered with Google, where he has been hosting GDELT for the past year, to make it available (here) as a public dataset that users can analyze directly with Google BigQuery. Previously, anyone interested in the data had to download the 100-gigabyte dataset and analyze it on their own machines. They still can, of course, and Leetaru recently built a catalog of recipes for various analyses and a BigQuery-based method for slicing off specific parts of the data.

See Derrick’s post for additional details.

When I previously wrote about GDELT it wasn’t available for querying with Google’s BigQuery. That should certainly improve access to this remarkable resource.

Perhaps intelligence gathering/analysis will become a cottage industry.

That’s a promising idea.

See also: Google BigQuery homepage.

Neo4j 2.1 – Graph ETL for Everyone

Thursday, May 29th, 2014

Neo4j 2.1 – Graph ETL for Everyone

From the post:

It’s an exciting time for Neo4j users and, of course, the Neo4j team as we’re releasing the 2.1 version of Neo4j! You’ve probably already seen the amazing strides we’ve taken when releasing our 2.0 version at the start of the year, and Neo4j 2.1 continues to improve the user experience while delivering some impressive under-the-hood improvements, and some interesting work on boosting Cypher too.

Easy import with ETL features directly in Cypher

Graphs are everywhere, but sometimes they’re buried in other systems and legacy databases. You need to extract the data then bring it into Neo4j to experience its true graph form. To help you do this, we’ve brought bulk load functionality directly into Cypher. The new LOAD CSV clause makes that a pleasant and simple task, optimized for graphs around millions scale – the kind of size that folks typically encounter when getting started with Neo4j.

Err, but the line:

You need to extract the data then bring it into Neo4j to experience its true graph form.

isn’t really true is it?

In other words, to process a graph with Neo4j, you have to extract, transform and load the date into Neo4j. Yes?

That is if I could address the data in situ (in its original place) and add the properties I need to process it as a graph, no extraction, transformation and loading are necessary.


Not to downplay the usefulness of better importing, if your software requires it, but we do need to be precise about what is being described.

There are other new features and improvements so download a copy of Neo4j 2.1 today!

100+ Interesting Data Sets for Statistics

Thursday, May 29th, 2014

100+ Interesting Data Sets for Statistics by Robert Seaton.


Summary: Looking for interesting data sets? Here’s a list of more than 100 of the best stuff, from dolphin relationships to political campaign donations to death row prisoners.

If we have data, let’s look at data. If all we have are opinions, let’s go with mine.

—Jim Barksdale

Compiled using Robert’s definition of “interesting” but I will be surprised if you don’t agree in most cases.

Curated collections of pointers to data sets come to mind as a possible information product.


I first saw this in a tweet by Aatish Bhatia.

Categorical Databases

Thursday, May 29th, 2014

Categorical Databases by David I. Spivak.

From Slide 2 of 58:

There is a fundamental connection between databases and categories.

  • Category theory can simplify how we think about and use databases.
  • We can clearly see all the working parts and how they fit together.
  • Powerful theorems can be brought to bear on classical DB problems.

The slides are “text heavy” but I think you will find that helpful rather than a hindrance in this case. 😉

From David Spivak’s homepage:

Purpose: I study information and communication, working towards a mathematical foundation for interoperability.

If you are looking for more motivation to get into category theory, this could be the place to start.

I first saw this in a tweet by Jim Duey.

Discovering Literature: Romantics and Victorians

Thursday, May 29th, 2014

Discovering Literature: Romantics and Victorians (British Library)

From “About this project:”

Exploring the Romantic and Victorian periods, Discovering Literature brings together, for the first time, a wealth of the British Library’s greatest literary treasures, including numerous original manuscripts, first editions and rare illustrations.

A rich variety of contextual material – newspapers, photographs, advertisements and maps – is presented alongside personal letters and diaries from iconic authors. Together they bring to life the historical, political and cultural contexts in which major works were written: works that have shaped our literary heritage.

William Blake’s notebook, childhood writings of the Brontë sisters, the manuscript of the Preface to Charles Dickens’s Oliver Twist, and an early draft of Oscar Wilde’s The Importance of Being Earnest are just some of the unique collections available on the site.

Discovering Literature features over 8000 pages of collection items and explores more than 20 authors through 165 newly-commissioned articles, 25 short documentary films, and 30 lesson plans. More than 60 experts have contributed interpretation, enriching the website with contemporary research. Designed to enhance the study and enjoyment of English literature, the site contains a dedicated Teachers’ Area supporting the curriculum for GCSE and A Level students.

These great works from the Romantic and Victorian periods form the first phase of a wider project to digitise other literary eras, including the 20th century.

On a whim I searched for Bleak House only to find: Bleak House first edition with illustrations, which includes images of the illustrations and the text. Moreover, it has related links, one of which is a review of Jude the Obscure that appeared in the Morning Post.

From the review:

To write a story of over five hundred pages, and longer by far than the majority of three-volume novels, without allowing one single ray of humour, or even cheerfulness, to dispel for a moment the gloomy atmosphere of hopeless pessimism was no ordinary task, and might have taxed the powers of the most relentless observers of life. Even Euripides, had he been given to the writing of novels, might well have faltered before such a tremendous undertaking.

Can you imagine finding such a review on

Mapping Bleak House into then current legal practice or Jude the Obscure into social customs and records of the time would be fascinating summer projects.

Emacs Settings for Clojure

Thursday, May 29th, 2014

My Optimal GNU Emacs Settings for Developing Clojure (so far) by Frédérick Giasson.

From the post:

In the coming months, I will start to publish a series of blog posts that will explain how RDF data can be serialized in Clojure code and more importantly what are the benefits of doing this. At Structured Dynamics, we started to invest resources into this research project and we believe that it will become a game changer regarding how people will consume, use and produce RDF data.

But I want to take a humble first step into this journey just by explaining how I ended up configuring Emacs for working with Clojure. I want to take the time to do this since this is a trials and errors process, and that it may be somewhat time-consuming for the new comers.

In an interesting twist for an article on Emacs, Frédérick recommends strongly that the reader consider Light Table as an IDE for Clojure over Emacs, especially if they are not already Emacs users.

What follows is a detailed description of changes for your .emacs file should you want to follow the Emacs route, including a LightTable theme for Emacs.

A very useful post and I am looking forward the the Clojure/RDF post to follow.

Structured Programming with go to Statements

Wednesday, May 28th, 2014

Structured Programming with go to Statements by Donald E. Knuth.


A consideration of several different examples sheds new light on the problem of creating reliable, well-structured programs that behave efficiently. This study focuses largely on two issues: (a) improved syntax for iterations and error exits, making it possible to write a larger class of programs clearly and efficiently without go to statements; (b) a methodology of program design, beginning with readable and correct, but possibly inefficient programs that are systematically transformed if necessary into efficient and correct, but possibly less readable code. The discussion brings out opposing points of view about whether or not go to statements should be abolished; some merit is found on both sides of this question. Fina!ly, an attempt is made to define the true nature of structured programming, and to recommend fruitful directions for further study.

As I learned today in Christophe Lalanne’s A bag of tweets / May 2014, this is the origin of the oft-quoted Knuth caution about “premature optimization.”

It’s one thing to know that Knuth once said something about “premature optimization” but quite another to see it in a larger context.

Be forewarned, the article has a table of contents and runs forty (40) pages.

However, it is delightful Knuth style writing from 1974.

If nothing else, you may be less dogmatic about the future of programming after reading Knuth’s projections from forty years ago.


The Deep Web you don’t know about

Wednesday, May 28th, 2014

The Deep Web you don’t know about by Jose Pagliery.

From the post:

Then there’s Tor, the darkest corner of the Internet. It’s a collection of secret websites (ending in .onion) that require special software to access them. People use Tor so that their Web activity can’t be traced — it runs on a relay system that bounces signals among different Tor-enabled computers around the world.

(video omitted)

It first debuted as The Onion Routing project in 2002, made by the U.S. Naval Research Laboratory as a method for communicating online anonymously. Some use it for sensitive communications, including political dissent. But in the last decade, it’s also become a hub for black markets that sell or distribute drugs (think Silk Road), stolen credit cards, illegal pornography, pirated media and more. You can even hire assassins.

If you take the figures of 54% of the deep web being databases, plus the 13% said to be on intranets, that leaves 33% of the deep web unaccounted for. How much of that is covered by Tor is hard to say.

But, we can intelligently guess that search doesn’t work any better in Tor than other segments of the Web, deep or not.

Given the risk of using even the Tor network, Online privacy is dead by Jose Pagliery (NSA vs. Silk Road), finding what you want efficiently could be worth a premium price.

Is guarding online privacy the the tipping point for paid collocation services?