## TextBlob: Simplified Text Processing

December 5th, 2013

TextBlob: Simplified Text Processing

From the webpage:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

….

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

• Noun phrase extraction
• Part-of-speech tagging
• Sentiment analysis
• Classification (Naive Bayes, Decision Tree)
• Tokenization (splitting text into words and sentences)
• Word and phrase frequencies
• Parsing
• n-grams
• Word inflection (pluralization and singularization) and lemmatization
• Spelling correction
• JSON serialization
• Add new models or languages through extensions
• WordNet integration

Knowing that TextBlob plays well with NLTK is a big plus!

## InfluxDB

December 5th, 2013

InfluxDB

From the webpage:

An open-source, distributed, time series, events, and metrics database with no external dependencies.

Time Series

Everything in InfluxDB is a time series that you can perform standard functions on like min, max, sum, count, mean, median, percentiles, and more.

Metrics

Scalable metrics that you can collect on any interval, computing rollups on the fly later. Track 100 metrics or 1 million, InfluxDB scales horizontally.

Events

InfluxDB’s data model supports arbitrary event data. Just write in a hash of associated data and count events, uniques, or grouped columns on the fly later.

The overview page gives some greater detail:

When we built Errplane, we wanted the data model to be flexible enough to store events like exceptions along with more traditional metrics like response times and server stats. At the same time we noticed that other companies were also building custom time series APIs on top of a database for analytics and metrics. Depending on the requirements these APIs would be built on top of a regular SQL database, Redis, HBase, or Cassandra.

We thought the community might benefit from the work we’d already done with our scalable backend. We wanted something that had the HTTP API built in that would scale out to billions of metrics or events. We also wanted sometehing that would make it simple to query for downsampled data, percentiles, and other aggregates at scale. Our hope is that once there’s a standard API, the community will be able to build useful tooling around it for data collection, visualization, and analysis.

While phrased as tracking server stats and events, I suspect InfluxDB would be just as happy tracking other types of stats or events.

I don’t know, say like the “I’m alive” messages your cellphone sends to the local towers for instance.

I first saw this in Nat Torkington’s Four short links: 5 November 2013.

## SICP in Clojure – Update

December 5th, 2013

In my post, SICP in Clojure, I incorrectly identified Steve Deobald as the maintainer of this project.

The original maintainer of the project placed a link on the site saying that Steve is the maintainer.

That is not correct.

Apologies to Steve and apologies to my readers who were hopeful this project would be going forward.

Any thoughts on moving this project forward?

I think the idea is a very sound one.

PS: Unlike many media outlets, I think corrections should be as prominent as the original mistakes.

## On Self-Licking Ice Cream Cones

December 5th, 2013

On Self-Licking Ice Cream Cones by Peter Worden. 1992

Ben Brody in The definitive glossary of modern US military slang quotes the following definition for a Self-Licking Ice Cream Cone:

A military doctrine or political process that appears to exist in order to justify its own existence, often producing irrelevant indicators of its own success. For example, continually releasing figures on the amount of Taliban weapons seized, as if there were a finite supply of such weapons. While seizing the weapons, soldiers raid Afghan villages, enraging the residents and legitimizing the Taliban’s cause.

Wikipedia at (Self-licking ice cream cone) reports the phrase was first used by Pete Worden in “On Self-Licking Ice Cream Cones” in 1992 to describe the NASA bureaucracy.

The keywords for the document are: Ice Cream Cones; Pork; NASA; Mafia; Congress.

Birds of a feather I would say.

Worden isolates several problems:

Problems, National, The Budget Process

This unfortunate train of events has resulted in a NASA which, more than any other agency, believes it works only for the appropriations committees. The senior staff of those committees, who have little interest in science or space, effectively run NASA. NASA senior offiicials’ noses are usually found at waist level near those committee staffers.

Problems, Closer to Home, NASA

“The Self-Licking Ice Cream Cone”

Since NASA effectively works for the most porkish part of Congress, it is not surprising that their programs are designed to maximize and perpetuate jobs programs in key Congressional districts. The Space Shuttle-Space Station is an outrageous example. Almost two-thirds of NASA’s budget is tied up in this self-licking program. The Shuttle is an unbelievably costly was to get to space at \$1 billion a pop. The Space Station is a silly design. Yet, this Station is designed so it can only be built by the Shuttle and the Shuttle is the only way to construct the Station….

“Inmates Running the Asylum”

NASA’s vaulted “peer review” process is not a positive factor, but an example of the “pork” mentality within the scientific community. It results in needlessly complex programs whose primary objective is not putting instruments in orbit, but maximizing the number of constituencies and investigators, thereby maximizing the political invulnerability of the program….

“Mafia Tactics”

…The EOS is a case in point. About a year ago, encouraged by criticism from some quarters of Congress and in the press, some scientists and satellite contractors began proposing small, cheap, near-term alternatives to the EOS “battlestars.” Senior NASA officials conducted, with impunity, an unbelievable campaign of threats against these critics. Members of the White House advisory committees were told they would not get NASA funding if they continued to probe the program….

“Shoot the Sick Horses, and their Trainers”

It is outrageous that the Hubble disaster resulted in no repercussions. All we hear is that some un-named technician, no longer working for the contractor, made a mistake in the early 1980s. Even in the Defense Department, current officials would lost their jobs over allowing such an untested and expensive system to be launched.

Compare Worden’s complaints to the security apparatus represented by the NSA and its kin.

Have you heard of any repercussions for any of the security failures and/or outrages?

Is there any doubt that the security apparatus exists solely to perpetuate the security apparatus?

By definition the NSA is a Self-Licking Ice Cream Cone.

Time to find a trash can.

Hubble: The Hubble Space Telescope Optical Systems Failure Report (pdf) Long before all the dazzling images from Hubble, it was virtually orbiting space junk for several years.

## U.S. Military Slang

December 5th, 2013

The definitive glossary of modern US military slang by Ben Brody.

From the post:

It’s painful for US soldiers to hear discussions and watch movies about modern wars when the dialogue is full of obsolete slang, like “chopper” and “GI.”

Slang changes with the times, and the military’s is no different. Soldiers fighting the wars in Iraq and Afghanistan have developed an expansive new military vocabulary, taking elements from popular culture as well as the doublespeak of the military industrial complex.

The US military drawdown in Afghanistan — which is underway but still awaiting the outcome of a proposed bilateral security agreement — is often referred to by soldiers as “the retrograde,” which is an old military euphemism for retreat. Of course the US military never “retreats” — rather it conducts a “tactical retrograde.”

This list is by no means exhaustive, and some of the terms originated prior to the wars in Afghanistan and Iraq. But these terms are critical to speaking the current language of soldiers, and understanding it when they speak to others. Please leave anything you think should be included in the comments.

Useful for documents that contain U.S. military slang, such as the Afghanistan War Diary.

As Ben notes at the outset, language changes over time so validate any vocabulary against your document/data set.

## Geoff (update)

December 5th, 2013

Geoff

My prior post on Geoff pointed to a page about Geoff that appears to no longer exist. I have updated that page to point to the new location.

Geoff is a text-based interchange format for Neo4j graph data that should be instantly readable to anyone familiar with Cypher, on which its syntax is based.

## N*SQL Matters @Barcelona, Spain Slides!

December 5th, 2013

N*SQL Matters @Barcelona, Spain Slides!

Slides for today but videos are said to be coming soon!

By Title:

• API Analytics with Redis and Bigquery, Javier Ramirez view the slides
• ArangoDB – a different approach to NoSQL, Lucas Dohmen view the slides
• Big Memory Scale-in vs. Scale-out, Niklas Bjorkman view the slides
• Bringing NoSQL to your mobile!, Patrick Heneise view the slides
• Building information systems using rapid application development methods, Michel Müller view the slides
• A call for sanity in NoSQL, Nathan Marz view the slides
• Cicerone: A Real-Time social venue recommender, Daniel Villatoro view the slides
• Database History from Codd to Brewer and Beyond, Doug Turnbull view the slides
• DynamoDB – on-demand NoSQL scaling as a service, Steffen Krause view the slides
• Getting down and dirty with Elasticsearch, Clinton Gormley view the slides
• Harnessing the Internet of Things with NoSQL, Michael Hausenblas view the slides
• How to survive in a BASE world, Uwe Friedrichsen view the slides
• Introduction to Graph Databases, Stefan Armbruster view the slides
• A Journey through the MongoDB Internals, Christian Kvalheim view the slides
• Killing pigs and saving Danish bacon with Riak, Joel Jacobsen view the slides
• Lambdoop, a framework for easy development of Big Data applications, Rubén Casado view the slides
• NoSQL Infrastructure, David Mytton view the slides
• Realtime visitor analysis with Couchbase and Elasticsearch, Jeroen Reijn view the slides
• SAMOA: A Platform for Mining Big Data Streams, Gianmarco De Francisci Morales view the slides
• Splout SQL: Web-latency SQL View for Hadoop, Iván de Prado view the slides
• Sprayer: low latency, reliable multichannel messaging for Telefonica Digital, Pablo Enfedaque and Javier Arias
• By Presenter:

• Armbruster, Stefan – Introduction to Graph Databases view the slides
• Bjorkman, Niklas – Big Memory – Scale-in vs. Scale-out view the slides
• Casado, Rubén – Lambdoop, a framework for easy development of Big Data applications view the slides
• Dohmen, Lucas – ArangoDB – a different approach to NoSQL view the slides
• Enfedaque, Pablo and Javier Arias – Sprayer: low latency, reliable multichannel messaging for Telefonica Digital view the slides
• Friedrichsen, Uwe – How to survive in a BASE world view the slides
• Gormley, Clinton – Getting down and dirty with Elasticsearch view the slides
• Hausenblas, Michael – Harnessing the Internet of Things with NoSQL view the slides
• Heneise, Patrick – Bringing NoSQL to your mobile! view the slides
• Jacobsen, Joel – Killing pigs and saving Danish bacon with Riak view the slides
• Krause, Steffen – DynamoDB – on-demand NoSQL scaling as a service view the slides
• Kvalheim, Christian – A Journey through the MongoDB Internals view the slides
• Marz, Nathan – A call for sanity in NoSQL view the slides
• Morales, Gianmarco De Francisci – SAMOA: A Platform for Mining Big Data Streams view the slides
• Müller, Michel – Building information systems using rapid application development methods view the slides
• Mytton, David – NoSQL Infrastructure view the slides
• Prado, Iván de – Splout SQL: Web-latency SQL View for Hadoop view the slides
• Ramirez, Javier – API Analytics with Redis and Bigquery view the slides
• Reijn, Jeroen – Realtime visitor analysis with Couchbase and Elasticsearch view the slides
• Turnbull, Doug – Database History from Codd to Brewer and Beyond view the slides
• Villatoro, Daniel – Cicerone: A Real-Time social venue recommender view the slides

I will update these with the videos when they are posted.

Enjoy!

## Apache CouchDB Conf Vancouver Videos!

December 5th, 2013

Apache CouchDB Conf Vancouver Videos!

By Title:

By Presenter:

Enjoy!

## Latest NSA Fire Storm

December 5th, 2013

Among the many places you can read about the latest Edward Snowden disclosures, NSA tracking cellphone locations worldwide, Snowden documents show by Barton Gellman and Ashkan Soltani, Washington Post, December 4, 2013, reads in part:

The National Security Agency is gathering nearly 5 billion records a day on the whereabouts of cellphones around the world, according to top-secret documents and interviews with U.S. intelligence officials, enabling the agency to track the movements of individuals — and map their relationships — in ways that would have been previously unimaginable.

The records feed a vast database that stores information about the locations of at least hundreds of millions of devices, according to the officials and the documents, which were provided by former NSA contractor Edward Snowden. New projects created to analyze that data have provided the intelligence community with what amounts to a mass surveillance tool.

And among the many denunciations of NSA activities, the American Library Association:

Nation’s Libraries Warn of NSA’s ‘Ravenous Hunger’ for Data

“We don’t want [library patrons] being surveilled because that will inhibit learning, and reading, and creativity,” said Alan Inouye of the American Library Association

- Andrea Germanos, staff writer

A quick search on Twitter quickly led to several hundred tweets with updates in the double digits every 30 seconds or so.

The general tenor being surprise (which I don’t understand) and outrage (that I do understand).

What is missing from the discussion is what to do to correct the situation?

Quite recently we all learned that MinuteMan missiles had their launch codes set to 00000000, despite direct presidential orders to the contrary.

I take that as evidence, along with the history of the NSA, that passing laws to regulate an agency that is without effective supervision is an exercise in futility.

Any assurance from the NSA that they are obeying U.S. laws is incapable of public verification and therefore should be presumed to be false.

The only effective means to limit NSA activities is to limit the NSA.

Let me repeat that: The only effective means to limit NSA activities is to limit the NSA.

We only have the NSA’s word that it has played an important role in protecting the U.S. from terrorists.

How can we test that tale?

My suggestion is that we defund the NSA for a period of not less than five years. No transfer of data, equipment or personnel. None.

If during the next five years, if U.S. based terrorism increases and proponents have a plausible plan for a new NSA, then we can re-consider it.

If there is, as is likely, no increase in U.S. based terrorism, we can avoid the expense of a rogue agency with its own agenda.

PS: I would not worry about the fates of NSA staff/contractors. There are a number of high tech surveillance opportunities in People’s Republic of China. Plus they have a form of government more suited to current NSA staff.

## Ekisto

December 5th, 2013

Ekisto comes from ekistics, the science of human settlements.

Ekisto is an interactive visualization of three online communities: StackOverflow, Github and Friendfeed. Ekisto tries to imagine and map our online habitats using graph algorithms and the city as a metaphor.

A graph layout algorithm arranges users in 2D space based on their similarity. Cosine similarity is computed based on the users’ network (Friendfeed), collaborate, watch, fork and follow relationships (Github), or based on the tags of posts contributed by users (StackOverflow). The height of each user represents the normalized value of the user’s Pagerank (Github, Friendfeed) or their reputation points (StackOverflow).

A project by Alex Dragulescu.

The three communities modeled are:

• stackoverflow.jul.2013
• github.mar.2012
• friendfeed.feb.2012

Stackoverflow can be searched by name but Github and FriendFeed, only by userid. Which makes moving from one community to the next along a particular user almost impossible.

I mention that because we all participate in many different communities and our roles and even status may vary widely from community to community.

Any one community view is an incomplete view of that person.

Beyond the need to map across communities, the other take away from Ekisto is the question of community formation?

That is, given the present snapshot of these communities, how did they evolve over time? Did particular people joining have a greater impact than others? Did some event trigger a rise in membership?

I first saw this in a tweet by Neil Saunders.

## German Digital Library releases API

December 5th, 2013

German Digital Library releases API by Lieke Ploeger.

From the post:

Last month the German Digital Library (Deutsche Digitale Bibliothek – DDB) made a promising step forward toward further opening up their data by releasing its API (Application Programming Interface) to the public. This API provides access to all the metadata of the DDB released under a CC0 license, which is the predominant share. The release of this API opens up a wide range of possibilities for users to build applications, create combinations with other data or include the German digitised cultural heritage on other platforms. In the future, the DDB also plans to organize a programming competition for API applications as well as a series of workshops for developers.

The official press release.

Technical documentation on the API (German).

A good excuse for you to brush up on your German. Besides, not all of it is in German.

## Free Language Lessons for Computers

December 4th, 2013

Free Language Lessons for Computers by Dave Orr.

From the post:

50,000 relations from Wikipedia. 100,000 feature vectors from YouTube videos. 1.8 million historical infoboxes. 40 million entities derived from webpages. 11 billion Freebase entities in 800 million web documents. 350 billion words’ worth from books analyzed for syntax.

These are all datasets that we’ve shared with researchers around the world over the last year from Google Research.

A great summary of the major data drops by Google Research over the past year. In many cases including pointers to additional information on the datasets.

One that I have seen before and that strikes me as particularly relevant to topic maps is:

Dictionaries for linking Text, Entities, and Ideas

What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.

Where can I find it: http://nlp.stanford.edu/pubs/crosswikis-data.tar.bz2

I want to know more: A description of the data, several examples, and ideas for uses for it can be found in a blog post or in the associated paper.

For most purposes, you would need far less than the full set of 7.5 million concepts. Imagine having the relevant concepts for a domain that was being automatically “tagged” as you composed prose about it.

Certainly less error-prone than marking concepts by hand!

## MusicGraph

December 4th, 2013

From the post:

Music data company Senzari has launched MusicGraph, a new service for discovering music by searching through graph of over a billion music-related data points.

MusicGraph includes a consumer-facing version and an API that can be used for commercial purposes. Senzari built the graph while working on the recommendation engine for its own streaming service, which has been rebranded as Wahwah.

Interestingly, MusicGraph is launching first on Firefox OS before coming to iOS, Android and Windows Phone in “the coming weeks.”

You know how much I try to avoid “practical” applications but when I saw aureliusgraphs tweet this as using the Titan database, I just had to mention it.

I think this announcement underlines something a comment said recently about promoting topic maps for what they do, not because they are topic maps.

Here, graphs are being promoted as the source of a great user experience, not because they are fun, powerful, etc. (all of which is also true).

## Homotopy Type Theory

December 4th, 2013

Homotopy Type Theory by Robert Harper. (Course with video lectures, notes, etc.)

Synopsis:

This is a graduate research seminar on Homotopy Type Theory (HoTT), a recent enrichment of Intuitionistic Type Theory (ITT) to include "higher-dimensional" types. The dimensionality of a type refers to the structure of its paths, the constructive witnesses to the equality of pairs of elements of a type, which themselves form a type, the identity type. In general a type is infinite dimensional in the sense that it exhibits non-trivial structure at all dimensions: it has elements, paths between elements, paths between paths, and so on to all finite levels. Moreover, the paths at each level exhibit the algebraic structure of a (higher) groupoid, meaning that there is always the "null path" witnessing reflexivity, the "inverse" path witnessing symmetry, and the "concatenation" of paths witnessing transitivity such that group-like laws hold "up to higher homotopy". Specifically, there are higher-dimensional paths witnessing the associative, unital, and inverse laws for these operations. Altogether this means that a type is a weak ∞-groupoid.

The significance of the higher-dimensional structure of types lies in the concept of a type-indexed family of types. Such families exhibit the structure of a fibration, which means that a path between two indices "lifts" to a transport mapping between the corresponding instances of the family that is, in fact, an equivalence. Thinking of paths as constructive witnesses for equality, this amounts to saying that equal indices give rise to equivalent types, and hence, by univalence, equal elements of the universe in which the family is valued. Thus, for example, if we think of the interval I as a type with two endpoints connected by a path, then an I-indexed family of types must assign equivalent types to the endpoints. In contrast the type B of booleans consists of two disconnected points, so that a B-indexed family of types may assign unrelated types to the two points of B. Similarly, mappings from I into another type A must assign connected points in A to the endpoints of the interval, whereas mappings from B into A are free to assign arbitrary points of A to the two booleans. These preservation principles are central to the structure of HoTT.

In many cases the path structure of a type becomes trivial beyond a certain dimension, called the level of the type. By convention the levels start at -2 and continue through -1, 0, 1, 2, and so on indefinitely. At the lowest, -2, level, the path structure of a type is degenerate in that there is an element to which all other elements are equal; such a type is said to be contractible, and is essentially a singleton. At the next higher level, -1, the type of paths between any two elements is contractible (level -2), which means that any two elements are equal, if there are any elements at all; such as type is a sub-singleton or h-proposition. At the next level, 0, the type of paths between paths between elements is contractible, so that any two elements are equal "in at most one way"; such a type is a set whose types of paths (equality relations) are all h-prop’s. Continuing in this way, types of level 1 are groupoids, those of level 2 are 2-groupoids, and so on for all finite levels.

ITT is capable of expressing only sets, which are types of level 0. Such types may have elements, and two elements may be considered equal in at most one way. A large swath of (constructive) mathematics may be formulated using only sets, and hence is amenable to representation in ITT. Computing applications, among others, require more than just sets. For example, it is often necessary to suppress distinctions among elements of a type so as to avoid over-specification; this is called proof irrelevance. Traditionally ITT has been enriched with an ad hoc treatment of proof irrelevance by introducing a universe of "propositions" with no computational content. In HoTT such propositions are types of level -1, requiring no special treatment or distinction. Such types arise by propositional truncation of a type to render degenerate the path structure of a type above level -1, ensuring that any two elements are equal in the sense of having a path between them.

Propositional truncation is just one example of a higher inductive type, one that is defined by specifying generators not only for its elements, but also for its higher-dimensional paths. The propositional truncation of a type is one that includes all of the elements of the type, and, in addition, a path between any two elements, rendering them equal. It is a limiting case of a quotient type in which only certain paths between elements are introduced, according to whether they are deemed to be related. Higher inductive types also permit the representation of higher-dimensional objects, such as the spheres of arbitrary dimension, as types, simply by specifying their "connectivity" properties. For example, the topological circle consists of a base point and a path starting and ending at that point, and the topological disk may be thought of as two half circles that are connected by a higher path that "fills in" the interior of the circle. Because of their higher path structure, such types are not sets, and neither are constructions such as the product of two circles.

The univalence axiom implies that an equivalence between types (an "isomorphism up to isomorphism") determines a path in a universe containing such types. Since two types can be equivalent in many ways (for example, there can be distinct bijections between two sets), univalence gives rise to types that are not sets, but rather are of a higher level, or dimension. The univalence axiom is mathematically efficient because it allows us to treat equivalent types as equal, and hence interchangeable in all contexts. In informal settings such identifications are often made by convention; in formal homotopy type theory such identifications are true equations.

If you think data types are semantic primitives with universal meaning/understanding, feel free to ignore this posting.

Data types can be usefully treated “as though” they are semantic primitives, but mistaking convenience for truth can be expensive.

The never ending cycle of enterprise level ETL for example. Even when it ends well it is expensive.

And there are all the cases where ETL or data integration don’t end well.

Homotopy Type Theory may not be the answer to those problems, but our current practices are known to not work.

Why not bet on an uncertain success versus the certainty of expense and near-certainty of failure?

## To fairly compare…

December 4th, 2013

LDBC D3.3.1 Use case analysis and choke point analysis Coordinators: Alex Averbuch and Norbert Martinez.

From the introduction:

Due largely to the Web, an exponentially increasing amount of data is generated each year. Moreover, a significant fraction of this data is unstructured, or semi-structured at best. This has meant that traditional data models are becoming increasingly restrictive and unsuitable for many application domains – the relational model in particular has been criticized for its lack of semantics. These trends have driven development of alternative database technologies, including graph databases.

The proliferation of applications dealing with complex networks has resulted in an increasing number of graph database deployments. This, in turn, has created demand for a means by which to compare the characteristics of different graph database technologies, such as: performance, data model, query expressiveness, as well as general functional and non-functional capabilities.

To fairly compare these technologies it is essential to first have a thorough understanding of graph data models, graph operations, graph datasets, graph workloads, and the interactions between all of these. (emphasis added)

In this rather brief report, the LDBC (Linked Data Benchmark Council) gives a thumbnail sketch of the varieties of graphs, graph databases, graph query languages, along with some summary use cases. To their credit, unlike some graph vendors, they do understand what is meant by a hyperedge. (see p.8)

On the other hand, they retreat from the full generality of graph models to “directed attributed multigraphs,” before evaluating any of the graph alternatives. (also at p.8)

It may be a personal prejudice but I would prefer to see fuller development of use cases and requirements before restricting the solution space.

Particularly since new developments in graph theory and/or technology are a weekly if not daily occurrence.

Premature focus on “unsettled” technology could result in a benchmark for yesterday’s version of graph technology.

Interesting I suppose but not terribly useful.

## Benchmarking Honesty

December 3rd, 2013

Benchmarking Honesty by David Rosenthal.

From the post:

Recently, someone brought to my attention a blog post that benchmarks FoundationDB and another responding to the benchmark itself. I’ll weigh in: I think this benchmark is unfair because it gives people too good an impression of FoundationDB’s performance. In the benchmark, 100,000 items are loaded into each database/storage engine in both sequential and random patterns. In the case of FoundationDB and other sophisticated systems like SQL Server, you can see that the performance of random and sequential writes are virtually the same; this points to the problem. In the case of FoundationDB, an “absorption” mechanism is able to cope with bursts of writes (on the order of a minute or two, usually) without actually updating the real data structures holding the data (i.e. only persisting a log to disk, and making changes available to read from RAM). Hence, the published test results are giving FoundationDB an unfair advantage. I think that you will find that if you sustain this workload for a longer time, like in real-world usages, FoundationDB might be significantly slower.

If you don’t recognize the name, David Rosenthal is the co-founder and CEO of FoundationDB.

What?

A CEO saying a benchmark favorable to his product is “unfair?”

Odd as it may sound, I think there is an honest CEO on the loose.

Statistically speaking, it had to happen eventually.

Seriously, high marks to David Rosenthal. We need more CEOs, engineers and presenters with a sense of honesty.

## Annual Christmas Tree Lecture (Knuth)

December 3rd, 2013

From the webpage:

Professor Knuth will present his 19th Annual Christmas Tree Lecture on Monday, December 9, 2013 at 7:00 pm in NVIDIA Auditorium in the new Huang Engineering Center, 475 Via Ortega, Stanford University (map). The topic will be Planar Graphs and Ternary Trees. There is no admission charge or registration required. For those unable to come to Stanford, register for the live webinar broadcast.

No doubt heavy sledding but what better way to prepare for the holiday season?

Date: Monday, December 9, 2013

Time:
7 p.m. – 8 p.m. Pacific
10 p.m. – 11 p.m. Eastern

## Scout [NLP, Move up from Twitter Feeds to Court Opinions]

December 3rd, 2013

Scout

Scout is a free service that provides daily insight to how our laws and regulations are shaped in Washington, DC and our state capitols.

These days, you can receive electronic alerts to know when a company is in the news, when a TV show is scheduled to air or when a sports team wins. Now, you can also be alerted when our elected officials take action on an issue you care about.

Scout allows anyone to subscribe to customized email or text alerts on what Congress is doing around an issue or a specific bill, as well as bills in the state legislature and federal regulations. You can also add external RSS feeds to complement a Scout subscription, such as press releases from a member of Congress or an issue-based blog.

Anyone can create a collection of Scout alerts around a topic, for personal organization or to make it easy for others to easily follow a whole topic at once.

Researchers can use Scout to see when Congress talks about an issue over time. Members of the media can use Scout to track when legislation important to their beat moves ahead in Congress or in state houses. Non-profits can use Scout as a tool to keep tabs on how federal and state lawmakers are making policy around a specific issue.

Early testing of Scout during its open beta phase alerted Sunlight and allies in time to successfully stop an overly broad exemption to the Freedom of Information Act from being applied to legislation that was moving quickly in Congress. Read more about that here.

Thank you to the Stanton Foundation, who contributed generous support to Scout’s development.

If your manager suggests a Twitter feed to test NLP, classification, sentiment, etc. code, ask to use Federal Court (U.S.) Court Opinion Feed instead.

Not all data is written in one hundred and forty (140) character chunks.

PS: Be sure to support/promote the Sunlight Foundation for making this data available.

## Project Tycho:… [125 Years of Disease Records]

December 3rd, 2013

Project Tycho: Data for Health

From the webpage:

After four years of data digitization and processing, the Project Tycho™ Web site provites open access to newly digitized and integrated data from the entire 125 years history of United States weekly nationally notifiable disease surveillance data since 1888. These data can now be used by scientists, decision makers, investors, and the general public for any purpose. The Project Tycho™ aim is to advance the availability and use of public health data for science and decision making in public health, leading to better programs and more efficient control of diseases.

Three levels of data have been made available: Level 1 data include data that have been standardized for specific analyses, Level 2 data include standardized data that can be used immediately for analysis, and Level 3 data are raw data that cannot be used for analysis without extensive data management. See the video tutoral.

An interesting factoid concerning disease reporting in the United States, cica 1917. Influenza, in 1917, was not a reportable disease. The Great Influenza by John Barry.

I am curious about the Level 3 data.

Mostly in terms of how much “data management” would be needed to make it useful?

Could be a window into the data management required to unify medical records in the United States.

Or simply a way to practice your data management skills.

## Using Hive to interact with HBase, Part 2

December 3rd, 2013

Using Hive to interact with HBase, Part 2 by Nick Dimiduk.

From the post:

This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.

“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”

This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.

If you learn from concrete examples and then feel your way further out, you will love this post!

## ISWC, Sydney 2013 (videos)

December 3rd, 2013

12th International Semantic Web Conference (ISWC), Sydney 2013

From the webpage:

ISWC 2013 is the premier international forum, for the Semantic Web / Linked Data Community. Here, scientists, industry specialists, and practitioners meet to discuss the future of practical, scalable, user-friendly, and game changing solutions.

Detailed information can be found at the ISWC 2013 website.

I count thirty-six (36) videos (including two tutorials).

Some of them are fairly short so suitable for watching while standing in checkout lines.

## Bokeh

December 3rd, 2013

Bokeh

From the webpage:

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.

To get started quickly, follow the Quickstart.

Visit the source repository: https://github.com/ContinuumIO/bokeh

The technical vision makes the case for Bokeh quite well:

Photographers use the Japanese word “bokeh” to describe the blurring of the out-of-focus parts of an image. Its aesthetic quality can greatly enhance a photograph, and photographers artfully use focus to draw attention to subjects of interest. “Good bokeh” contributes visual interest to a photograph and places its subjects in context.

In this vein of focusing on high-impact subjects while always maintaining a relationship to the data background, the Bokeh project attempts to address fundamental challenges of large dataset visualization:

• How do we look at all the data?
• What are the best perceptual approaches to honestly and accurately represent the data to domain experts and SMEs so they can apply their intuition to the data?
• Are there automated approaches to accurately reduce large datasets so that outliers and anomalies are still visible, while we meaningfully represent baselines and backgrounds? How can we do this without “washing away” all the interesting bits during a naive downsampling?
• If we treat the pixels and topology of pixels on a screen as a bottleneck in the I/O channel between hard drives and an analyst’s visual cortex, what are the best compression techniques at all levels of the data transformation pipeline?
• How can scientists and data analysts be empowered to use visualization fluidly, not merely as an output facility or one stage of a pipeline, but as an entire mode of engagement with data and models?
• Are language-based approaches for expressing mathematical modeling and data transformations the best way to compose novel interactive graphics?
• What data-oriented interactions (besides mere linked brushing/selection) are useful for fluid, visually-enable analysis?

Not likely any time soon but posting data for scientific research in ways that enable interactive analysis by readers (and snapshotting their results) could take debates over data and analysis to a whole new level.

As opposed to debating dots on a graph not of your own making and where alternative analyses are not available.

## Of Algebirds, Monoids, Monads, …

December 3rd, 2013

From the post:

Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.

The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”

You can call this a “blog post” but I rarely see blog posts with a table of contents!

The post should come with a warning: May require substantial time to read, digest, understand.

Just so you know, I was hooked by this paragraph early on:

So let me use a different example because adding Int values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way?

Are you motivated?

I first saw this in a tweet by CompSciFact.

## Five Stages of Data Grief

December 3rd, 2013

Five Stages of Data Grief by Jeni Tennison.

From the post:

As organisations come to recognise how important and useful data could be, they start to think about using the data that they have been collecting in new ways. Often data has been collected over many years as a matter of routine, to drive specific processes or sometimes just for the sake of it. Suddenly that data is repurposed. It is probed, analysed and visualised in ways that haven’t been tried before.

Data analysts have a maxim:

If you don’t think you have a quality problem with your data, you haven’t looked at it yet.

Every dataset has its quirks, whether it’s data that has been wrongly entered in the first place, automated processing that has introduced errors, irregularities that come from combining datasets into a consistent structure or simply missing information. Anyone who works with data knows that far more time is needed to clean data into something that can be analysed, and to understand what to leave out, than in actually performing the analysis itself. They also know that analysis and visualisation of data will often reveal bugs that you simply can’t see by staring at a spreadsheet.

But for the people who have collected and maintained such data — or more frequently their managers, who don’t work with the data directly — this realisation can be a bit of a shock. In our last ODI Board meeting, Sir Tim Berners-Lee suggested that the data curators need to go through was something like the five stages of grief described by the Kübler-Ross model.

Jeni covers the five stages of grief from a data quality standpoint and offers a sixth stage. (No spoilers follow, read her post.)

Correcting input/transformation errors is one level of data cleaning.

But the near-collapse of HealthCare.gov shows how streams of “clean” data can combine into a large pool of “dirty” data.

Every contributor supplied ‘clean’ data but when combined with other “clean” data, confusion was the result.

“Clean” data is an ongoing process at two separate levels:

Level 1: Traditional correction of input/transformation errors (as per Jeni).

Level 2: Preparation of data for transformation into “clean” data for new purposes.

The first level is familiar.

The second we all know as ad-hoc ETL.

Enough knowledge is gained to make a transformation work, but that knowledge isn’t passed on with the data or more generally.

Or as we all learned from television: “Lather, rinse, repeat.”

A good slogan if you are trying to maximize sales of shampoo, but a wasteful one when describing ETL for data.

What if data curators captured the knowledge required for ETL, making every subsequent ETL less resource intensive and less error prone?

I think that would qualify as data cleaning.

You?

## Announcing Open LEIs:…

December 3rd, 2013

Announcing Open LEIs: a user-friendly interface to the Legal Entity Identifier system

From the post:

Today, OpenCorporates announces a new sister website, Open LEIs, a user-friendly interface on the emerging Global Legal Entity Identifier System.

At this point many, possibly most, of you will be wondering: what on earth is the Global Legal Entity Identifier System? And that’s one of the reasons why we built Open LEIs.

The Global Legal Entity Identifier System (aka the LEI system, or GLEIS) is a G20/Financial Stability Board-driven initiative to solve the issues of identifiers in the financial markets. As we’ve explained in the past, there are a number of identifiers out there, nearly all of them proprietary, and all of them with quality issues (specifically not mapping one-to-one with legal entities). Sometimes just company names are used, which are particularly bad identifiers, as not only can they be represented in many ways, they frequently change, and are even reused between different entities.

This problem is particularly acute in the financial markets, meaning that regulators, banks, market participants often don’t know who they are dealing with, affecting everything from the ability to process trades automatically to performing credit calculations to understanding systematic risk.

The LEI system aims to solve this problem, by providing permanent, IP-free, unique identifiers for all entities participating in the financial markets (not just companies but also municipalities who issue bonds, for example, and mutual funds whose legal status is a little greyer than companies).

The post cites five key features for Open LEIs:

1. Search on names (despite slight misspellings) and addresses
2. Browse the entire (100,000 record) database and/or filter by country, legal form, or the registering body
3. A permanent URL for each LEI
5. Data is available as XML or JSON

As the post points out, the data isn’t complete but dragging legal entities out into the light is never easy.

Use this resource and support it if you are interested in more and not less financial transparency.

## Top search tips from Exeter and Bristol

December 2nd, 2013

Top search tips from Exeter and Bristol by Karen Blakeman.

From the post:

A couple of weeks ago I was in Exeter and Bristol leading workshops for NHS South West on “Google & Beyond”. We covered advanced Google commands, Google Scholar and alternatives to Google. Below are the combined top tips from the two sessions. I may have missed a couple from the list as I could not read my writing, so if you attended one of the workshops let me know if I’ve omitted your suggested tip.

All of these tips are no doubt old hat to readers of this blog but Karen gives a nice list of search tips you can forward to your users.

Enjoy!

## A language for search and discovery

December 2nd, 2013

A language for search and discovery by Tony Russell-Rose.

Abstract:

In order to design better search experiences, we need to understand the complexities of human information-seeking behaviour. In this paper, we propose a model of information behaviour based on the needs of users across a range of search and discovery scenarios. The model consists of a set of modes that users employ to satisfy their information goals.

We discuss how these modes relate to existing models of human information seeking behaviour, and identify areas where they differ. We then examine how they can be applied in the design of interactive systems, and present examples where individual modes have been implemented in interesting or novel ways. Finally, we consider the ways in which modes combine to form distinct chains or patterns of behaviour, and explore the use of such patterns both as an analytical tool for understanding information behaviour and as a generative tool for designing search and discovery experiences.

Tony’s post is also available as a pdf file.

A deeply interesting paper but consider the evidence that underlies it:

The scenarios were collected as part of a series of requirements workshops involving stakeholders and customer-facing staff from various client organisations. A proportion of these engagements focused on consumer-oriented site search applications (resulting in 277 scenarios) and the remainder on enterprise search applications (104 scenarios).

The scenarios were generated by participants in breakout sessions and subsequently moderated by the workshop facilitator in a group session to maximise consistency and minimise redundancy or ambiguity. They were also prioritised by the group to identify those that represented the highest value both to the end user and to the client organisation.

This data possesses a number of unique properties. In previous studies of information seeking behaviour (e.g. [5], [10]), the primary source of data has traditionally been interview transcripts that provide an indirect, verbal account of end user information behaviours. By contrast, the current data source represents a self-reported account of information needs, generated directly by end users (although a proportion were captured via proxy, e.g. through customer facing staff speaking on behalf of the end users). This change of perspective means that instead of using information behaviours to infer information needs and design insights, we can adopt the converse approach and use the stated needs to infer information behaviours and the interactions required to support them.

Moreover, the scope and focus of these scenarios represents a further point of differentiation. In previous studies, (e.g. [8]), measures have been taken to address the limitations of using interview data by combining it with direct observation of information seeking behaviour in naturalistic settings. However, the behaviours that this approach reveals are still bounded by the functionality currently offered by existing systems and working practices, and as such do not reflect the full range of aspirational or unmet user needs encompassed by the data in this study.

Finally, the data is unique in that is constitutes a genuine practitioner-oriented deliverable, generated expressly for the purpose of designing and delivering commercial search applications. As such, it reflects a degree of realism and authenticity that interview data or other research-based interventions might struggle to replicate.

It’s not a bad thing to use data from commercial engagements for research and is certainly better than usability studies based on 10 to 12 undergraduates, two of whom did not complete the study.

However, I would be very careful about trying to generalize from a self-selected group even for commercial search, much less the fuller diversity of other search scenarios.

On the other hand, the care with which the data was analyzed makes it an excellent data point against which to compare other data points, hopefully with more diverse populations.

## Modern Healthcare Architectures Built with Hadoop

December 2nd, 2013

Modern Healthcare Architectures Built with Hadoop by Justin Sears.

From the post:

We have heard plenty in the news lately about healthcare challenges and the difficult choices faced by hospital administrators, technology and pharmaceutical providers, researchers, and clinicians. At the same time, consumers are experiencing increased costs without a corresponding increase in health security or in the reliability of clinical outcomes.

One key obstacle in the healthcare market is data liquidity (for patients, practitioners and payers) and some are using Apache Hadoop to overcome this challenge, as part of a modern data architecture. This post describes some healthcare use cases, a healthcare reference architecture and how Hadoop can ease the pain caused by poor data liquidity.

As you would guess, I like the phrase data liquidity.

And Justin lays out the areas where we are going to find “poor data liquidity.”

Source data comes from:

 Legacy Electronic Medical Records (EMRs) Transcriptions PACS Medication Administration Financial Laboratory (e.g. SunQuest, Cerner) RTLS (for locating medical equipment & patient throughput) Bio Repository Device Integration (e.g. iSirona) Home Devices (e.g. scales and heart monitors) Clinical Trials Genomics (e.g. 23andMe, Cancer Genomics Hub) Radiology (e.g. RadNet) Quantified Self Sensors (e.g. Fitbit, SmartSleep) Social Media Streams (e.g. FourSquare, Twitter)

But then I don’t see what part of the Hadoop architecture addresses the problem of “poor data liquidity.”

Do you?

I thought I had found it when Charles Boicey (in the UCIH case study) says:

“Hadoop is the only technology that allows healthcare to store data in its native form. If Hadoop didn’t exist we would still have to make decisions about what can come into our data warehouse or the electronic medical record (and what cannot). Now we can bring everything into Hadoop, regardless of data format or speed of ingest. If I find a new data source, I can start storing it the day that I learn about it. We leave no data behind.”

But that’s not “data liquidity,” not in any meaningful sense of the word. Dumping your data to paper would be just as effective and probably less costly.

To be useful, “data liquidity” must has a sense of being integrated with data from diverse sources. To present the clinician, researcher, health care facility, etc. with all the data about a patient, not just some of it.

I also checked the McKinsey & Company report “The ‘Big Data’ Revolution in Healthcare.” I didn’t expect them to miss the data integration question and they didn’t.

The second exhibit in the McKinsey and Company report (the full report):

Integration of data pools required for major opportunities.

I take that to mean that in order to have meaningful healthcare reform, integration of health care data pools is the first step.

Do you disagree?

And if that’s true, that we need integration of health care data pools first, do you think Hadoop can accomplish that auto-magically?

I don’t either.

## NIH deposits first batch of genomic data for Alzheimer’s disease

December 2nd, 2013

NIH deposits first batch of genomic data for Alzheimer’s disease

From the post:

Researchers can now freely access the first batch of genome sequence data from the Alzheimer’s Disease Sequencing Project (ADSP), the National Institutes of Health (NIH) announced today. The ADSP is one of the first projects undertaken under an intensified national program of research to prevent or effectively treat Alzheimer’s disease.

The first data release includes data from 410 individuals in 89 families. Researchers deposited completed WGS data on 61 families and have deposited WGS data on parts of the remaining 28 families, which will be completed soon. WGS determines the order of all 3 billion letters in an individual’s genome. Researchers can access the sequence data at dbGaP or the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS), https://www.niagads.org.

“Providing raw DNA sequence data to a wide range of researchers proves a powerful crowd-sourced way to find genomic changes that put us at increased risk for this devastating disease,” said NIH Director, Francis S. Collins, M.D., Ph.D., who announced the start of the project in February 2012. “The ADSP is designed to identify genetic risks for late-onset of Alzheimer’s disease, but it could also discover versions of genes that protect us. These insights could lead to a new era in prevention and treatment.”

As many as 5 million Americans 65 and older are estimated to have Alzheimer’s disease, and that number is expected to grow significantly with the aging of the baby boom generation. The National Alzheimer’s Project Act became law in 2011 in recognition of the need to do more to combat the disease. The law called for upgrading research efforts by the public and private sectors, as well as expanding access to and improving clinical and long term care. One of the first actions taken by NIH under Alzheimer’s Act was the allocation of additional funding in fiscal 2012 for a series of studies, including this genome sequencing effort. Today’s announcement marks the first data release from that project.

You will need to join with or enlist in a open project with bioinformatics and genmics expertise to make a contribution but the data is “out there.”

Not to mention the need to integrate existing medical literature, legacy data from prior patients, drug trials, etc., despite usual semantic confusion of the same.

## Google’s R Style Guide [TM Guides?]

December 2nd, 2013

From the webpage:

R is a high-level programming language used primarily for statistical computing and graphics. The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify. The rules below were designed in collaboration with the entire R user community at Google.

Useful if you are trying to develop good R coding habits from the start.

Makes me wonder about a similar need for topic maps authors? At least on a project by project basis.

If I am always representing marital status as an occurrence on a topic, that isn’t going to fit well with another author who always uses associations to represent marriages.

There could be compelling reasons in a project for choosing one or the other.

Similar questions will come up with other subjects and relationships as well.

It won’t be 100% but best to try to get everyone off on the same foot and to validate output against your local authoring guidelines.