Archive for November, 2014

The Big Book of PostgreSQL

Sunday, November 30th, 2014

The Big Book of PostgreSQL by Thom Brown.

From the post:

Documentation is crucial to the success of any software program, particularly open source software (OSS), where new features and functionality are added by many contributors. Like any OSS, PostgreSQL needs to produce accurate, consistent and reliable documentation to guide contributors’ work and reflect the functionality of every new contribution. Documentation also an important source of information for developers, administrators and other end users as they will take actions or base their work on the functionality described in the documentation. Typically, the author of a new feature provides the relevant documentation changes to the project, and that person can be anyone in any role in IT. So it can really come from anywhere.

Postgres documentation is extensive (you can check out the latest 9.4 documentation here). In fact, the U.S. community PDF document is 2,700 pages long. It would be a mighty volume and pretty unwieldy if published as a physical book. The Postgres community is keenly aware that the quality of documentation can make or break an open source project, and thus regularly updates and improves our documentation, a process I’ve appreciated being able to take part in.

A recent podcast, Solr Usability with Steve Rowe & Tim Potter goes to some lengths to describe efforts to improve Solr documentation.

If you know anyone in the Solr community, consider this a shout out that PostgreSQL documentation isn’t a bad example to follow.

Jeopardy! clues data

Sunday, November 30th, 2014

Jeopardy! clues data Nathan Yau writes:

Here’s some weekend project data for you. Reddit user trexmatt dumped a dataset for 216,930 Jeopardy! questions and answers in JSON and CSV formats, a scrape from the J! Archive. Each clue is represented by category, money value, the clue itself, the answer, round, show number, and air date.

Nathan suggests hunting for Daily Doubles but then discovers someone has done that. (See Nathan’s post for the details.)


The Week When Attackers Started Winning The War On Trust

Sunday, November 30th, 2014

The Week When Attackers Started Winning The War On Trust by Kevin Bocek.

Kevin details four news stories:

And concludes:

This is important because …
All of these news stories should be a serious wake-up call for the infosec industry. The threatscape has changed. Attackers need trusted status, and they know they can get it by misusing keys and certificates. What else does this mean? Unfortunately, it means almost every single security control that you’ve spent millions on to protect your network, apps, and data can be undermined and circumvented.

Kevin has a good argument. The compromise of identity (identity being a favorite theme of topic maps) strikes deep into the first assumption of any security system. The first assumption being an identified user has a right to be on the system. Once an intruder gets past that hurdle, …. damage will follow.

Kevin advises to stop blindly trusting certificates and keys. OK, then what?

In a separate post from April of this year, Kevin advises:

  • Know where all keys and certificates are located
  • Revoke, replace, install, and verify keys and certificates with new ones

Not without difficulty, particularly if you don’t know where all the keys and certificates are located but necessary steps none the less.

The admonition to “not to blindly trust certificates” sounds great but in practice will be a question of the potential loses from blind trust. In some cases the risk may be low enough that blind trust is a reasonable choice. In others, like traveling executives, there will be a need for hardware based encryption by default with no user intervention.

Old World Language Families

Sunday, November 30th, 2014

language tree

Be design (limitation of space) not all languages were included.

Despite that, the original post has gotten seven hundred and twenty-two (722) comments as of today. A large number of which mention wanting a poster of this visualization.

I could assemble the same information, sans the interesting graphic and get no comments and no requests for a poster version.


What makes this presentation (map) compelling? Could you transfer it to another body of information with the same impact?

What do you make of: “The approximate sizes of our known living language populations, compared to year 0.”

Suggested reading on what makes some graphics compelling and others not?

Originally from: Stand Still Stay Silent Comic, although I first saw it at: Old World Language Families by Randy Krum.

PS: For extra credit, how many languages can you name that don’t appear on this map?

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

Sunday, November 30th, 2014

Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D by Justin Kestelyn.

From the announcement of Cloudera Labs, a list of existing projects and a call for your suggestions of others:

Apache Kafka is among the “charter members” of this program. Since its origin as proprietary LinkedIn infrastructure just a couple years ago for highly scalable and resilient real-time data transport, it’s now one of the hottest projects associated with Hadoop. To stimulate feedback about Kafka’s role in enterprise data hubs, today we are making a Kafka-Cloudera Labs parcel (unsupported) available for installation.

Other initial Labs projects include:

  • Exhibit
    Exhibit is a library of Apache Hive UDFs that usefully let you treat array fields within a Hive row as if they were “mini-tables” and then execute SQL statements against them for deeper analysis.
  • Hive-on-Spark Integration
    A broad community effort is underway to bring Apache Spark-based data processing to Apache Hive, reducing query latency considerably and allowing IT to further standardize on Spark for data processing.
  • Impyla
    Impyla is a Python (2.6 and 2.7) client for Impala, the open source MPP query engine for Hadoop. It communicates with Impala using the same standard protocol as ODBC/JDBC drivers.
  • Oryx
    Oryx, a project jointly spearheaded by Cloudera Engineering and Intel, provides simple, real-time infrastructure for large-scale machine learning/predictive analytics applications.
  • RecordBreaker
    RecordBreaker, a project jointly developed by Hadoop co-founder Mike Cafarella and Cloudera, automatically turns your text-formatted data into structured Avro data–dramatically reducing data prep time.

As time goes on, and some of the projects potentially graduate into CDH components (or otherwise remain as Labs projects), more names will join the list. And of course, we’re always interested in hearing your suggestions for new Labs projects.

Do you take the rapid development of the Hadoop ecosystem as a lesson about investment in R&D by companies both large and small?

Is one of your first questions to a startup: What are your plans for investing in open source R&D?

Other R&D labs that I should call out for special mention?

BigBench: Toward An Industry-Standard Benchmark for Big Data Analytics

Sunday, November 30th, 2014

BigBench: Toward An Industry-Standard Benchmark for Big Data Analytics by Bhaskar D Gowda and Nishkam Ravi.

From the post:

Benchmarking Big Data systems is an open problem. To address this concern, numerous hardware and software vendors are working together to create a comprehensive end-to-end big data benchmark suite called BigBench. BigBench builds upon and borrows elements from existing benchmarking efforts in the Big Data space (such as YCSB, TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, and TPC-DS). Intel and Cloudera, along with other industry partners, are working to define and implement extensions to BigBench 1.0. (A TPC proposal for BigBench 2.0 is in the works.)

BigBench Overview

BigBench is a specification-based benchmark with an open-source reference implementation kit, which sets it apart from its predecessors. As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations. As a “kit”, it would lower the barrier of entry to benchmarking by providing a readily available reference implementation as a starting point. As open source, it would allow multiple implementations to co-exist in one place and be reused by different vendors, while providing consistency where expected for the ability to provide meaningful comparisons.

The BigBench specification comprises two key components: a data model specification, and a workload/query specification. The structured part of the BigBench data model is adopted from the TPC-DS data model depicting a product retailer, which sells products to customers via physical and online stores. BigBench’s schema uses the data of the store and web sales distribution channel and augments it with semi-structured and unstructured data as shown in Figure 1.

big bench figure 1

Figure 1: BigBench data model specification

The data model specification is implemented by a data generator, which is based on an extension of PDGF. Plugins for PDGF enable data generation for an arbitrary schema. Using the BigBench plugin, data can be generated for all three pats of the schema: structured, semi-structured and unstructured.

BigBench 1.0 workload specification consists of 30 queries/workloads. Ten of these queries have been taken from the TPC-DS workload and run against the structured part of the schema. The remaining 20 were adapted from a McKinsey report on Big Data use cases and opportunities. Seven of these run against the semi-structured portion and five run against the unstructured portion of the schema. The reference implementation of the workload specification is available here.

BigBench 1.0 specification includes a set of metrics (focused around execution time calculation) and multiple execution modes. The metrics can be reported for the end-to-end execution pipeline as well as each individual workload/query. The benchmark also defines a model for submitting concurrent workload streams in parallel, which can be extended to simulate the multi-user scenario.

The post continues with plans for BigBench 2.0 and Intel tests using BigBench 1.0 against various hardware configurations.

An important effort and very much worth your time to monitor.

None other than the Open Data Institute and Thomson Reuters have found that identifiers are critical to bringing value to data. With that realization and the need to map between different identifiers, there is an opportunity for identifier benchmarks in BigData. Identifiers that have documented semantics and the ability to merge with other identifiers.

A benchmark for BigData identifiers would achieve two very important goals:

First, it would give potential users a rough gauge of the amount of effort required to reach some X goal of identifiers. The cost of identifiers will vary for data set to data set but having no cost information at all, leaves potential users to expect the worst.

Second, as with the BigBench benchmark, potential users could compare apples to apples in judging the performance and characteristics of identifier schemes (such as topic map merging).

Both of those goals seem like worthy ones to me.


What is Walmart Doing Right and Topic Maps Doing Wrong?

Sunday, November 30th, 2014

Sentences to ponder by Chris Blattman.

From the post:

Walmart reported brisk traffic overnight. The retailer, based in Bentonville, Ark., said that 22 million shoppers streamed through stores across the country on Thanksgiving Day. That is more than the number of people who visit Disney’s Magic Kingdom in an entire year.

A blog at the Wall Street Journal suggests the numbers are even better than those reported by Chris:

Wal-Mart said it had more than 22 million customers at its stores between 6 p.m. and 10 p.m. Thursday, similar to its numbers a year ago.

In four (4) hours WalMart has more customers than visit Disney’s Magic Kingdom in a year.

Granting as of October 31, 2014, WalMart has forty-nine hundred and eighty-seven (4987) locations in the United States, that remains an impressive number.

Suffice it to say the number of people actively using topic maps is substantially less than the Thankgiving customer numbers for Walmart.

I don’t have the answer to the title question.

Asking you to ponder it as you do holiday shopping.

What is different about your experience in online or offline shopping that makes it different from your experience with topic maps? Or pre- or post-shopping experience that is different?

I will take this question up again after the first of 2015 so be working on your thoughts and suggestions over the holiday season.


Gates Foundation champions open access

Sunday, November 30th, 2014

Gates Foundation champions open access by Rebecca Trager.

From the post:

The Bill & Melinda Gates Foundation, based in Washington, US, has adopted a new policy that requires free, unrestricted access and reuse of all peer-reviewed published research that the foundation funds, including any underlying data sets.

The policy, announced last week, applies to all of the research that the Gates Foundation funds entirely or partly, and will come into effect on 1 January, 2015. Specifically, the new rule dictates that published research be made available under a ‘Creative Commons’ generic license, which means that it can be copied, redistributed, amended and commercialised. During a two-year transition period, the foundation will allow publishers a 12 month embargo period on access to their research papers and data sets.

If other science and humanities sponsors follow Gates, nearly universal open access will be an accomplished fact by the end of the decade.

There will be wailing and gnashing of teeth by those who expected protectionism to further their careers at the expense of the public. I can bear their discomfort with a great deal of equanimity. Can’t you?

BRAIN WORKSHOP [Dec. 3-5, 2014]

Sunday, November 30th, 2014

BRAIN WORKSHOP: Workshop on the Research Interfaces between Brain Science and Computer Science

From the post:

brain conference logo

Computer science and brain science share deep intellectual roots – after all, computer science sprang out Alan Turing’s musings about the brain in the spring of 1936.  Today, understanding the structure and function of the human brain is one of the greatest scientific challenges of our generation. Decades of study and continued progress in our knowledge of neural function and brain architecture have led to important advances in brain science, but a comprehensive understanding of the brain still lies well beyond the horizon.  How might computer science and brain science benefit from one another? Computer science, in addition to staggering advances in its core mission, has been instrumental in scientific progress in physical and social sciences. Yet among all scientific objects of study, the brain seems by far the most blatantly computational in nature, and thus presumably most conducive to algorithmic insights, and more apt to inspire computational research. Models of the brain are naturally thought of as graphs and networks; machine learning seeks inspiration in human learning; neuromorphic computing models attempt to use biological insight to solve complex problems. Conversely, the study of the brain depends crucially on interpretation of data: imaging data that reveals structure, activity data that relates to the function of individual or groups of neurons, and behavioral data that embodies the complex interaction of all of these elements.


This two-day workshop, sponsored by the Computing Community Consortium (CCC) and National Science Foundation (NSF), brings together brain researchers and computer scientists for a scientific dialogue aimed at exposing new opportunities for joint research in the many exciting facets, established and new, of the interface between the two fields.   The workshop will be aimed at questions such as these:

  • What are the current barriers to mapping the architecture of the brain, and how can they be overcome?
  • What scale of data suffices for the discovery of “neural motifs,” and what might they look like?
  • What would be required to truly have a “neuron in-silico,” and how far are we from that?
  • How can we connect models across the various scales (biophysics – neural function – cortical functional units – cognition)?
  • Which computational principles of brain function which can be employed to solve computational problems? What sort of platforms would support such work?
  • What advances are needed in hardware and software to enable true brain-computer interfaces? What is the right “neural language” for communicating with the brain?
  • How would one be able to test equivalence between a computational model and the modeled brain subsystem?
  • Suppose we could map the network of billions nodes and trillions connections that is the brain, how would we infer structure?
  • Can we create open-science platforms enabling computational science on enormous amounts of heterogeneous brain data (as it has happened in genomics)?
  • Is there a productive algorithmic theory of the brain, which can inform our search for answers to such questions?

Plenary addresses to be live-streamed at:

December 4, 2014 (EST):

8:40 AM Plenary: Jack Gallant, UC Berkeley, A Big Data Approach to Functional Characterization of the Mammalian Brain

2:00 PM Plenary: Aude Oliva, MIT Time, Space and Computation: Converging Human Neuroscience and Computer Science

7:30 PM Plenary: Leslie Valiant, Harvard, Can Models of Computation in Neuroscience be Experimentally Validated?

December 5, 2014 (EST)

10:05 AM Plenary: Terrence Sejnowski, Salk Institute, Theory, Computation, Modeling and Statistics: Connecting the Dots from the BRAIN Initiative

Mark your calendars today!

Types and Functions

Saturday, November 29th, 2014

Types and Functions by Bartosz Milewski.

From the post:

The category of types and functions plays an important role in programming, so let’s talk about what types are and why we need them.

Who Needs Types?

There seems to be some controversy about the advantages of static vs. dynamic and strong vs. weak typing. Let me illustrate these choices with a thought experiment. Imagine millions of monkeys at computer keyboards happily hitting random keys, producing programs, compiling, and running them.

monkey with keyboard

With machine language, any combination of bytes produced by monkeys would be accepted and run. But with higher level languages, we do appreciate the fact that a compiler is able to detect lexical and grammatical errors. Lots of monkeys will go without bananas, but the remaining programs will have a better chance of being useful. Type checking provides yet another barrier against nonsensical programs. Moreover, whereas in a dynamically typed language, type mismatches would be discovered at runtime, in strongly typed statically checked languages type mismatches are discovered at compile time, eliminating lots of incorrect programs before they have a chance to run.

So the question is, do we want to make monkeys happy, or do we want to produce correct programs?

That is a sample of the direct, literate prose that awaits you if you follow this series on category theory.

Game Of Live in Clojure with Quil

Saturday, November 29th, 2014

Game Of Live in Clojure with Quil by Nazarii Bardiuk.

You already know Conway’s Game of Life. You may not know Quil, a Clojure wrapper for Processing (version 2.0 is out). Look on this as a learning opportunity.

It doesn’t take long for the Game of Life to turn into serious research so advance this work with caution. 😉


I first saw this in a tweet by Anna Pawlicka.

PS: I don’t know if “Live” in the title is a typo or intentional so I left it.


Saturday, November 29th, 2014

Eurotechnopanic by Jeff Jarvis.

From the post:

I worry about Germany and technology. I fear that protectionism from institutions that have been threatened by the internet — mainly media giants and government — and the perception of a rising tide of technopanic in the culture will lead to bad law, unnecessary regulation, dangerous precedents, and a hostile environment that will make technologists, investors, and partners wary of investing and working in Germany.

I worry, too, about Europe and technology. Germany’s antiprogress movement is spreading to the EU — see its court’s decision creating a so-called right to be forgotten — as well as to members of the EU — see Spain’s link tax.

I worry mostly about damage to the internet, its freedoms and its future, limiting the opportunities an open net presents to anyone anywhere. Three forces are at work endangering the net: control, protectionism, and technopanic.

Jeff pens a wonderful essay and lingers the longest on protectionism and eurotechnopanic. His essay is filled with examples both contemporary and from history. Except for EU officials who are too deep into full panic to listen, it is very persuasive.

Jeff proposes a four-part plan for Google to overcome eurotechnopanic:

  • Address eurotechnopanic at a cultural and political level
  • Invest in innovation in German startups
  • Teach and explain the benefits of sharing information
  • Share the excitement of the net and technology

I would like to second all of those points, but Jeff forgets that German economic and social stability are the antithesis of the genetic makeup of Google.

Take one of Jeff’s recommendations: Invest in innovation in German startups.

Really? Show of hands. How many people have known German startups with incompetent staff who could not be fired?

Doubtful on that score?

Terminating Employees in Germany is extremely complicated and subject to a myriad of formal and substantive requirements. Except for small businesses, employers as generally required to show cause and are not free to select whom to terminate. Social criteria such as seniority, age, and number of dependants must be considered. Termination of employees belonging to certain classes such as pregnant women and people with disabilities requires prior involvement of and approval from government agencies. If a workers’ council has been established, it must be heard prior to terminating an employee in most instances. It is good practice to involve counsel already in the preparation of any termination or layoff. Any termination or layoff will most likely trigger a lawsuit. Most judges are employee friendly and most employees have insurance to cover their attorney fees. SIEGWART GERMAN AMERICAN LAW advises employers on all issues regarding termination of employment and alternative buyout strategies in Germany.

Notice Requirements: Even if the employer can show cause, the employee must be given notice. Notice periods under German law, which can be found in statutes and collective bargaining agreements, vary depending on seniority and can be more than six months long.

Firing Employees in Germany: Employees can be fired for good cause under extraordinary circumstances. Counsel should be involved immediately since the right to fire an employee is waived if the employer does not act within two weeks. SIEGWART GERMAN AMERICAN LAW has the experience and expertise to evaluate the circumstances of your case, develop a strategy, and make sure all formal requirements are timely met. [None of this is legal advice. Drawn from:]

That’s just not the world Google lives in. Not to say innovative work doesn’t happen in Germany and the EU because it does. But that innovative work is despite the government and not fostered by it.

Google should address eurotechnopanic by relocating bright Europeans to other countries that are startup and innovation friendly. Not necessarily to the United States. The future economic stars are said to be India, China, Korea, all places with innovative spirits and good work ethics.

Eventually citizens of the EU and the German people in particular will realize they have been betrayed by people seeking to further their own careers at the expense of their citizens.

PS: I wonder how long German banking would survive if the ISPs and Telcos decided enough was enough? Parochialism isn’t something that should be long tolerated.

Cynomix Automatic Analysis, Clustering, and Indexing of Malware

Saturday, November 29th, 2014

From the description:

Malware analysts in the public and private sectors need to make sense of an ever-growing stream of malware on an ongoing basis yet the common modus operandi is to analyze each file individually, if at all.

In the current paradigm, it is difficult to quickly understand the attributes of a particular set of malware binaries and how they differ from or are similar to others in a large database, to re-use previous analyses performed on similar samples, and to collaborate with other analysts. Thus, work is carried out inefficiently and a valuable intelligence signal may be squandered.

In this webinar, you will learn about Cynomix, a web-based community malware triage tool that:

  • Creates a paradigm shift in scalable malware analysis by providing capabilities for automatic analysis, clustering, and indexing of malware
  • Uses novel machine learning and scalable search technologies
  • Provides several interactive views for exploring large data sets of malware binaries.

Visualization/analysis tool for malware. Creating a global database of malware data.

No anonymous submission of malware at present but “not keeping a lot of data” on submissions. No one asked what “not keeping a lot of data” meant exactly. There may be a gap in what is meant by and heard by as “a lot.” Currently, 35,000 instances of malware in the system. There have been as many as a million samples in the system.

Very good visualization techniques. Changes to data requests produced changes in the display of “similar” malware.

Take special note that networks/clusters change based on selection of facets. Imagine a topic map that could do the same with merging.

If you are interested in public (as opposed to secret) collecting of malware, this is an effort to support.

You can sign up for a limited beta here:

I first saw this in a tweet by Rui SFDA.

PS: You do realize that contemporary governments, like other franchises, are responsible for your cyber-insecurity. Yes?


Saturday, November 29th, 2014

VSEARCH: Open and free 64-bit multithreaded tool for processing metagenomic sequences, including searching, clustering, chimera detection, dereplication, sorting, masking and shuffling

From the webpage:

The aim of this project is to create an alternative to the USEARCH tool developed by Robert C. Edgar (2010). The new tool should:

  • have open source code with an appropriate open source license
  • be free of charge, gratis
  • have a 64-bit design that handles very large databases and much more than 4GB of memory
  • be as accurate or more accurate than usearch
  • be as fast or faster than usearch

We have implemented a tool called VSEARCH which supports searching, clustering, chimera detection, dereplication, sorting and masking (commands --usearch_global, --cluster_smallmem, --cluster_fast, --uchime_ref, --uchime_denovo, --derep_fulllength, --sortbysize, --sortbylength and --maskfasta, as well as almost all their options).

VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.

The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement.

The reconciliation of characteristics that are different is the only way that merging in topic maps varies from the clustering found in bioinformatics programs like VSEARCH. The results are a cluster of items deemed “similar” on some basis and with topic maps, subject to further processing.

Scaling isn’t easy in bioinformatics but it hasn’t been found daunting either.

There is much to be learned from projects such as VSEARCH to inform the processing of topic maps.

I first saw this in a tweet by Torbjørn Rognes.

Python 3 Text Processing with NLTK 3 Cookbook

Friday, November 28th, 2014

Python 3 Text Processing with NLTK 3 Cookbook by Jacobs Perkins.

From the post:

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

It’s never too late to update your wish list! 😉


Open Access and the Humanities…

Friday, November 28th, 2014

Open Access and the Humanities: Contexts, Controversies and the Future by Martin Paul Eve.

From the description:

If you work in a university, you are almost certain to have heard the term ‘open access’ in the past couple of years. You may also have heard either that it is the utopian answer to all the problems of research dissemination or perhaps that it marks the beginning of an apocalyptic new era of ‘pay-to-say’ publishing. In this book, Martin Paul Eve sets out the histories, contexts and controversies for open access, specifically in the humanities. Broaching practical elements alongside economic histories, open licensing, monographs and funder policies, this book is a must-read for both those new to ideas about open-access scholarly communications and those with an already keen interest in the latest developments for the humanities.

Open access to a book on open access!

I was very amused by Gary F. Daught’s comment on the title:

“Open access for scholarly communication in the Humanities faces some longstanding cultural/social and economic challenges. Deep traditions of scholarly authority, reputation and vetting, relationships with publishers, etc. coupled with relatively shallow pockets in terms of funding (at least compared to the Sciences) and perceptions that the costs associated with traditional modes of scholarly communication are reasonable (at least compared to the Sciences) can make open access a hard sell. Still, there are new opportunities and definite signs of change. Among those at the forefront confronting these challenges while exploring open access opportunities for the Humanities is Martin Paul Eve.”

In part because Gary worded his description of the humanities as: “Deep traditions of scholarly authority, reputation and vetting, relationships with publishers,…” which is true, but is a nice way of saying:

Controlling access to the Dead Sea Scrolls was a great way to attract graduate students to professors and certain universities.

Controlling access to the Dead Sea Scrolls was a great way to avoid criticism of work by denying others access to the primary materials.

Substitute current access issues to data, both in the humanities and sciences for “Dead Sea Scrolls” and you have a similar situation.

I mention the Dead Sea Scroll case because after retarding scholarship for decades, the materials are more or less accessible now. The sky hasn’t fallen, newspapers aren’t filled with bad translations, salvation hasn’t been denied (so far as we know), to anyone holding incorrect theological positions due to bad work on the Dead Sea Scrolls.

A good read but I have to differ with Martin on his proposed solution to the objection that open access has no peer review.

Unfortunately Martin treats concerns about peer review as though they were rooted in empirical experience such that contrary experimental results will lead to a different conclusion.

I fear that Martin overlooks that peer review is a religious belief and can no more be diminished by contrary evidence than transubstantiation. Consider all the peer review scandals you have read or heard about in the past year. Has that diminished anyone’s faith in peer review? What about the fact that in the humanities, up to 98% of all monographs remain uncited after a decade?

Assuming peer review is supposed to assure the quality of publishing, a reasonable person would conclude that 98% of what has been published and is uncited, either wasn’t worth writing about and/or peer review was no guarantor of quality.

The key to open access is for publishing and funding organizations to mandate open access to data used in research and/or publication. No exceptions, no on request but deposits in open access archives.

Scholars who have self-assessed themselves as needing the advantages of non-open access data will be unhappy but I can’t say that matters all that much to me.


I first saw this in a tweet by Martin Haspelmath.

505 Million Internet Censors and Growing (EU) – Web Magna Carta

Friday, November 28th, 2014

EU demands ‘right to be forgotten’ be applied globally by James Temperton.

From the post:

Google will be told to extend the “right to be forgotten” outside the EU in a move that is likely to cause a furrowing of brows at the search giant. EU regulators are looking to close a loophole in the controversial online privacy legislation that effectively renders it useless.

Currently people only have the right to be forgotten on versions of Google within the EU. Anyone wanting to see uncensored Google searches can simply use the US version of Google instead. The EU has taken a tough line against Google, expressing annoyance at its approach to removing search results.

The right to be forgotten allows people to remove outdated, irrelevant or misleading web pages from search results relating to their name. The EU will now ask Google to apply the ruling to the US version of its site, sources told Bloomberg Businessweek.

The latest demographic report shows the EU with five hundred and five million potential censors of search results with more on the way.

Not content to be an island of ignorance, the EU now wants to censor search results on behalf of the entire world.

In other words, 7.3% of the world’s population will decide what search results can be seen by the other 92.7%.

Tim Berners-Lee’s call for a new Magna Carta won’t help with this problem:

Berners-Lee’s Magna Carta plan is to be taken up as part of an initiative called “the web we want”, which calls on people to generate a digital bill of rights in each country – a statement of principles he hopes will be supported by public institutions, government officials and corporations.

A statement of principles? Really?

As I recall the original Magna Carta, a group of feudal lords forced King John to accept the agreement. Had the original Magna Carta been only a statement of principles, it would not be remembered today. It was the enforcement of those principles that purchased its hold on our imaginations.

So long as the Web is subject to the arbitrary and capricious demands of geographically bounded government entities, it will remain a hostage of those governments.

We have a new class of feudal lords, the international ISPs. The real question is whether they will take up a Magna Carta for the Web and enforce its terms on geographically bounded government entities?

The first provision of such a Magna Carta should not be:

1. Freedom of expression online and offline

The first provision should be:

1. No government shall interfere with the delivery of content of any nature to any location connected to the Web. (Governments are free to penalize receipt or possession of information but in no way shall hinder its transfer and delivery on the Web.)

Such that when some country decides Google or others must censor information, which is clearly interference with delivery of content, the feudal lords, in this case, ISPs, will terminate all Internet access for that country.

It will be amusing to see how long spy agencies, telephone services, banks, etc., can survive without free and unfettered access to the global network.

The global network deserves and needs a global network governance structure separate and apart from existing government infrastructures. Complete with a court system with a single set of laws and regulations, an assembly to pass laws and other structures as are needful.

Don’t look so surprised. It is a natural progression from small hamlets to larger regional governance and to the geographically bounded governments of today. Which have proven themselves to be best at carry for their own rather than their citizens. A global service like the Net needs global governance and there are no existing bodies competent to take up the mantle.*

ISPs need to act as feudal lords to free themselves and by implication us, from existing government parochialism. Only then will we realize the full potential of the Web.

* You may be tempted to say the United Nations could govern the Web but consider that the five (5) permanent members of the Security Council can veto any resolution they care block from the other one hundred and ninety three (193) members. Having the ISPs govern the Web would be about as democratic, if not more so.

The Three Breakthroughs That Have Finally Unleashed AI on the World

Thursday, November 27th, 2014

The Three Breakthroughs That Have Finally Unleashed AI on the World by Kevin Kelly.

I was attracted to this post by a tweet from Diana Zeaiter Joumblat which read:

How parallel computing, big data & deep learning algos have put an end to the #AI winter

It has been almost a decade now but while riding to lunch with a doctoral student in computer science, they related how their department was known as “human-centered computing” because AI had gotten such a bad name. In their view, the AI winter was about to end.

I was quite surprised as I remembered the AI winter of the 1970’s. 😉

The purely factual observations by Kevin in this article are all true, but I would not fret too much about:

As it does, this cloud-based AI will become an increasingly ingrained part of our everyday life. But it will come at a price. Cloud computing obeys the law of increasing returns, sometimes called the network effect, which holds that the value of a network increases much faster as it grows bigger. The bigger the network, the more attractive it is to new users, which makes it even bigger, and thus more attractive, and so on. A cloud that serves AI will obey the same law. The more people who use an AI, the smarter it gets. The smarter it gets, the more people use it. The more people that use it, the smarter it gets. Once a company enters this virtuous cycle, it tends to grow so big, so fast, that it overwhelms any upstart competitors. As a result, our AI future is likely to be ruled by an oligarchy of two or three large, general-purpose cloud-based commercial intelligences.

I am very doubtful of: “The more people who use an AI, the smarter it gets.”

As we have seen from the Michael Brown case, the more people who comment on a subject, the less is known about it. Or at least what is known gets lost is a tide of non-factual but stated as factual, information.

The assumption that the current AI boom will crash upon is the assumption that accurate knowledge can be obtained in all areas. Some, like chess, sure, that can happen. Do we know all the factors at play between the police and the communities they serve?

AIs can help with medicine, but considering what we don’t know about the human body and medicine, taking a statistical guess at the best treatment isn’t reasoning, it a better betting window.

I am all for pushing AIs where they are useful, but being ever mindful that it has no more operations than my father’s mechanical pocket calculator I remember as a child. Impressive but that’s not the equivalent of intelligence.

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX

Thursday, November 27th, 2014

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX by Kenny Bastani.

From the post:

I’ve just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. This docker image is a great addition to Neo4j if you’re looking to do easy PageRank or community detection on your graph data. Additionally, the results of the graph analysis are applied back to Neo4j.

This gives you the ability to optimize your recommendation-based Cypher queries by filtering and sorting on the results of the analysis.

This rocks!

If you were looking for an excuse to investigate Docker or Spark or GraphX or Neo4j, it has arrived!


Advertising 101 – Spyware

Thursday, November 27th, 2014

Lisa Vaas has some basic advertising advice for spyware manufacturers/vendors:

To adhere to the legal side of the line, monitoring apps have to be marketed at employers who want to keep an eye on their workers, or guardians who want to watch over their kids.

From: Spyware app StealthGenie’s CEO fined $500K, forfeits source code

$500K is a pretty good pop at the start of the holiday season.

For further background on the story, see Lisa’s other story on this: Head of ‘StealthGenie’ mobile stalking app indicted for selling spyware and the Federal proceedings proper.

Be careful how you advertise!

Neo4j 2.1.6 (release)

Wednesday, November 26th, 2014

Neo4j 2.1.6 (release)

From the post:

Neo4j 2.1.6 is a maintenance release, with critical improvements.

Notably, this release:

  • Resolves a critical shutdown issue, whereby IO errors were not always handled correctly and could result in inconsistencies in the database due to failure to flush outstanding changes.
  • Significantly reduce the file handle requirements for the lucene based indexes.
  • Resolves an issue in consistency checking, which could falsely report store inconsistencies.
  • Extends the Java API to allow the degree of a node to be easily obtained (the count of relationships, by type and direction).
  • Resolves a significant performance degradation that affected the loading of relationships for a node during traversals.
  • Resolves a backup issue, which could result in a backup store that would not load correctly into a clustered environment (Neo4j Enterprise).
  • Corrects a clustering issue that could result in the master failing to resume its role after an outage of a majority of slaves (Neo4j Enterprise).

All Neo4j 2.x users are recommended to upgrade to this release. Upgrading to Neo4j 2.1, from Neo4j 1.9.x or Neo4j 2.0.x, requires a migration to the on-disk store and can not be reversed. Please ensure you have a valid backup before proceeding, then use on a test or staging server to understand any changed behaviors before going into production.

Neo4j 1.9 users may upgrade directly to this release, and are recommended to do so carefully. We strongly encourage verifying the syntax and validating all responses from your Cypher scripts, REST calls, and Java code before upgrading any production system. For information about upgrading from Neo4j 1.9, please see our Upgrading to Neo4j 2 FAQ.

For a full summary of changes in this release, please review the CHANGES.TXT file contained within the distribution.


As with all software upgrades, do not delay until the day before you are leaving on holiday!

Spark, D3, data visualization and Super Cow Powers

Wednesday, November 26th, 2014

Spark, D3, data visualization and Super Cow Powers by Mateusz Fedoryszak.

From the post:

Did you know that the amount of milk given by a cow depends on the number of days since its last calving? A plot of this correlation is called a lactation curve. Read on to find out how do we use Apache Spark and D3 to find out how much milk we can expect on a particular day.

lactation chart

There are things that except for a client’s request, I have never been curious about. 😉

How are you using Spark?

I first saw this in a tweet by Anna Pawlicka

NSA partners with Apache to release open-source data traffic program

Tuesday, November 25th, 2014

NSA partners with Apache to release open-source data traffic program by Steven J. Vaughan-Nichols.

From the post:

Many of you probably think that the National Security Agency (NSA) and open-source software get along like a house on fire. That's to say, flaming destruction. You would be wrong.

[image and link omitted]

In partnership with the Apache Software Foundation, the NSA announced on Tuesday that it is releasing the source code for Niagarafiles (Nifi). The spy agency said that Nifi "automates data flows among multiple computer networks, even when data formats and protocols differ".

Details on how Nifi does this are scant at this point, while the ASF continues to set up the site where Nifi's code will reside.

In a statement, Nifi's lead developer Joseph L Witt said the software "provides a way to prioritize data flows more effectively and get rid of artificial delays in identifying and transmitting critical information".

I don’t doubt the NSA efforts at open source software. That isn’t saying anything about how closely the code would need to be proofed.

Perhaps encouraging more open source projects from the NSA will eat into the time they have to spend writing malware. 😉

Something to look forward to!

Falsehoods Programmers Believe About Names

Tuesday, November 25th, 2014

Falsehoods Programmers Believe About Names by Patrick McKenzie.

From the post:

John Graham-Cumming wrote an article today complaining about how a computer system he was working with described his last name as having invalid characters. It of course does not, because anything someone tells you is their name is — by definition — an appropriate identifier for them. John was understandably vexed about this situation, and he has every right to be, because names are central to our identities, virtually by definition.

I have lived in Japan for several years, programming in a professional capacity, and I have broken many systems by the simple expedient of being introduced into them. (Most people call me Patrick McKenzie, but I’ll acknowledge as correct any of six different “full” names, any many systems I deal with will accept precisely none of them.) Similarly, I’ve worked with Big Freaking Enterprises which, by dint of doing business globally, have theoretically designed their systems to allow all names to work in them. I have never seen a computer system which handles names properly and doubt one exists, anywhere.

So, as a public service, I’m going to list assumptions your systems probably make about names. All of these assumptions are wrong. Try to make less of them next time you write a system which touches names.

McKenzie has an admittedly incomplete list of forty (40) myths for people’s names.

If there are that many for people’s names, I wonder what the count is for all other subjects?

Including things on the Internet of Things?

I first saw this in a tweet by OnePaperPerDay.

Announcing Apache Pig 0.14.0

Tuesday, November 25th, 2014

Announcing Apache Pig 0.14.0 by Daniel Dai.

From the post:

With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. Apache Tez supports YARN-based, high performance batch and interactive data processing applications in Hadoop that need to handle datasets scaling to terabytes or petabytes.

The Apache community just released Apache Pig 0.14.0,and the main feature is Pig on Tez. In this release, we closed 334 Jira tickets from 35 Pig contributors. Specific credit goes to the virtual team consisting of Cheolsoo Park, Rohini Palaniswamy, Olga Natkovich, Mark Wagner and Alex Bain who were instrumental in getting Pig on Tez working!

Screen Shot 2014-11-24 at 10.40.43 AMThis blog gives a brief overview of Pig on Tez and other new features included in the release.

Pig on Tez

Apache Tez is an alternative execution engine focusing on performance. It offers a more flexible interface so Pig can compile into a better execution plan than is possible with MapReduce. The result is consistent performance improvements in both large and small queries.

Since it is the Thanksgiving holiday this week in the United States, this release reminds me to ask why is turkey the traditional Thanksgiving meal? Everyone likes bacon better. 😉

Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support

Tuesday, November 25th, 2014

Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support by Marko A. Rodriguez.

There are days when I wonder if Marko ever sleeps or if the problem of human cloning has already been solved.

This is one of those day:

The other day Dan LaRocque and I were working on a Hadoop-based GraphComputer for Titan so we could do bulk loading into Titan. First we wrote the BulkLoading VertexProgram: bulkloader/
…and then realized, “huh, we can just execute this with GiraphGraph. Huh! We can just execute this with TinkerGraph!” In fact, as a side note, the BulkLoaderVertexProgram is general enough to work for any TinkerPop Graph.

So great, we can just use GiraphGraph (or any other TinkerPop implementation that has a GraphComputer (e.g. TinkerGraph)). However, Titan is all about scale and when the size of your graph is larger than the total RAM in your cluster, we will still need a MapReduce-based GraphComputer. Thinking over this, it was realized: Giraph-Gremlin is very little Giraph and mostly just Hadoop — InputFormats, HDFS interactions, MapReduce wrappers, Configuration manipulations, etc. Why not make GiraphGraphComputer just a particular GraphComputer supported by Gremlin-Hadoop (a new package).

With that, Giraph-Gremlin no longer exists. Hadoop-Gremlin now exists. Hadoop-Gremlin behaves the exact same way as Giraph-Gremlin, save that we will be adding a MapReduceGraphComputer to Hadoop-Gremlin. In this way, Hadoop-Gremlin will support two GraphComputer: GiraphGraphComputer and MapReduceGraphComputer.

The master/ branch is updated and the docs for Giraph have been re-written, though I suspect there will be some dangling references in the docs here and there for a while.

Up next, Matthias and I will create MapReduceGraphComputer that is smart about “partitioned vertices” — so you don’t get the Faunus scene where if a vertex doesn’t fit in memory, an exception. This will allow vertices with as many edges as you want (though your data model is probably shotty if you have 100s of millions of edges on one vertex 😉 ……………….. Matthias will be driving that effort and I’m excited to learn about the theory of vertex partitioning (i.e. splitting a single vertex across machines).


Ferguson Municipal Public Library

Tuesday, November 25th, 2014

Ferguson Municipal Public Library

Ashley Ford tweeted that donations should be made to the Ferguson Municipal Public Library.

While schools are closed in Ferguson, the library has stayed open and has been a safe refuge.

Support the Ferguson Municipal Public Library as well as your own.

Libraries are where our tragedies, triumphs, and history live on for future generations.

Treasury Island: the film

Tuesday, November 25th, 2014

Treasury Island: the film by Lauren Willmott, Boyce Keay, and Beth Morrison.

From the post:

We are always looking to make the records we hold as accessible as possible, particularly those which you cannot search for by keyword in our catalogue, Discovery. And we are experimenting with new ways to do it.

The Treasury series, T1, is a great example of a series which holds a rich source of information but is complicated to search. T1 covers a wealth of subjects (from epidemics to horses) but people may overlook it as most of it is only described in Discovery as a range of numbers, meaning it can be difficult to search if you don’t know how to look. There are different processes for different periods dating back to 1557 so we chose to focus on records after 1852. Accessing these records requires various finding aids and multiple stages to access the papers. It’s a tricky process to explain in words so we thought we’d try demonstrating it.

We wanted to show people how to access these hidden treasures, by providing a visual aid that would work in conjunction with our written research guide. Armed with a tablet and a script, we got to work creating a video.

Our remit was:

  • to produce a video guide no more than four minutes long
  • to improve accessibility to these records through a simple, step-by–step process
  • to highlight what the finding aids and documents actually look like

These records can be useful to a whole range of researchers, from local historians to military historians to social historians, given that virtually every area of government action involved the Treasury at some stage. We hope this new video, which we intend to be watched in conjunction with the written research guide, will also be of use to any researchers who are new to the Treasury records.

Adding video guides to our written research guides are a new venture for us and so we are very keen to hear your feedback. Did you find it useful? Do you like the film format? Do you have any suggestions or improvements? Let us know by leaving a comment below!

This is a great illustration that data management isn’t something new. The Treasury Board has kept records since 1557 and has accumulated a rather extensive set of materials.

The written research guide looks interesting but since I am very unlikely to ever research Treasury Board records, I am unlikely to need it.

However, the authors have anticipated that someone might be interested in process of record keeping itself and so provided this additional reference:

Thomas L Heath, The Treasury (The Whitehall Series, 1927, GP Putnam’s Sons Ltd, London and New York)

That would be an interesting find!

I first saw this in a tweet by Andrew Janes.

Datomic 0.9.5078 now available

Tuesday, November 25th, 2014

Datomic 0.9.5078 now available by Ben Kamphaus.

From the post:

This message covers changes in this release. For a summary of critical release notices, see

The Datomic Team recommends that you always take a backup before adopting a new release.

## Changed in 0.9.5078

  • New CloudWatch metrics: `WriterMemcachedPutMusec`, `WriterMemcachedPutFailedMusec` `ReaderMemcachedPutMusec` and `ReaderMemcachedPutFailedMusec` track writes to memcache. See
  • Improvement: Better startup performance for databases using fulltext.
  • Improvement: Enhanced the Getting Started examples to include the Pull API and find specifications.
  • Improvement: Better scheduling of indexing jobs during bursty transaction volumes
  • Fixed bug where Pull API could incorrectly return renamed attributes.
  • Fixed bug that caused `db.fn/cas` to throw an exception when `false` was passed as the new value.

In case you haven’t walked through Datomic, you really should.

Here is one example why:

Next download the subset of the mbrainz database covering the period 1968-1973 (which the Datomic team has scientifically determined as being the most important period in the history of recorded music): [From:]

Truer words were never spoken! 😉


New York Times API extractor and Google Maps visualization (Wandora Tutorial)

Tuesday, November 25th, 2014

New York Times API extractor and Google Maps visualization (Wandora Tutorial)

From the description:

Video reviews the New York Times API extractor, the Google Maps visualization, and the graph visualization of Wandora application. The extractor is used to collect event data which is then visualized on a map and as a graph. Wandora is an open source tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. For more information see

This is impressive, although the UI may have more options than MS Word. 😉 (It may not, I haven’t counted every way to access every option.)

Here is the result that was obtained by use of drop down menus and selecting:

wandora event map

The Times logo marks events extracted from the New York Times and merged for display with Google Maps.

Not technically difficult but it is good to see a function of interest to ordinary users in a topic map application.

I have the latest release of Wandora. Need to start walking through the features.