## Data Carpentry (+ Sorted Nordic Scores)

August 21st, 2014

Data Carpentry by David Mimno.

From the post:

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.

Note: data carpentry seems to already be a thing

I’m not convinced that “carpentry” is the best prestige target.

The first mention of carpenters on a sorted version of the Nordic Scores (Colorado Adoption Project: Resources for Researchers. Institute for Behavioral Genetics, University of Colorado Boulder) is at 147.*

I would go for data scientist since mercenary isn’t listed as an occupation.

The usual cautions apply. Prestige is as difficult or perhaps more so to measure than any other social construct. The data is from 1989 and so may not reflect “current” prestige rankings.

*(I have removed the classes and sorted by prestige score, to create Sorted Nordic Scores.)

## …Loosely Consistent Distributed Programming

August 21st, 2014

Abstract:

Driven by the widespread adoption of both cloud computing and mobile devices, distributed computing is increasingly commonplace. As a result, a growing proportion of developers must tackle the complexity of distributed programming—that is, they must ensure correct application behavior in the face of asynchrony, concurrency, and partial failure.

To help address these difficulties, developers have traditionally relied upon system infrastructure that provides strong consistency guarantees (e.g., consensus protocols and distributed transactions). These mechanisms hide much of the complexity of distributed computing—for example, by allowing programmers to assume that all nodes observe the same set of events in the same order. Unfortunately, providing such strong guarantees becomes increasingly expensive as the scale of the system grows, resulting in availability and latency costs that are unacceptable for many modern applications.

Hence, many developers have explored building applications that only require loose consistency guarantees—for example, storage systems that only guarantee that all replicas eventually converge to the same state, meaning that a replica might exhibit an arbitrary state at any particular time. Adopting loose consistency involves making a well-known tradeoff: developers can avoid paying the latency and availability costs incurred by mechanisms for achieving strong consistency, but inexchange they must deal with the full complexity of distributed computing. As a result, achieving correct application behavior in this environment is very difficult.

This thesis explores how to aid developers of loosely consistent applications by providing programming language support for the difficulties they face. The language level is a natural place to tackle this problem: because developers that use loose consistency have fewer system facilities that they can depend on, consistency concerns are naturally pushed into application logic. In part, our goal has been to recognize, formalize, and automate application-level consistency patterns.

We describe three language variants that each tackle a different challenge in distributed programming. Each variant is a modification of Bloom, a declarative language for distributed programming we have developed at UC Berkeley. The first variant of Bloom, BloomL, enables deterministic distributed programming without the need for distributed coordination. Second, Edelweiss allows distributed storage reclamation protocols to be generated in a safe and automatic fashion. Finally, BloomPO adds sophisticated ordering constraints that we use to develop a declarative, high-level implementation of concurrent editing, a particularly difficult class of loosely consistent programs.

Unless you think of topic maps as static files, recent developments in “loosely consistent distributed programming” should be high on your reading list.

It’s entirely possible to have a topic map that is a static file, even one that has been printed out to paper. But that seems like a poor target for development. Captured information begins progressing towards staleness from the moment of its capture.

I first saw this in a tweet by Peter Bailis.

## The Little Book of Semaphores

August 21st, 2014

The Little Book of Semaphores by Allen Downey.

From the webpage:

The Little Book of Semaphores is a free (in both senses of the word) textbook that introduces the principles of synchronization for concurrent programming.

In most computer science curricula, synchronization is a module in an Operating Systems class. OS textbooks present a standard set of problems with a standard set of solutions, but most students don’t get a good understanding of the material or the ability to solve similar problems.

The approach of this book is to identify patterns that are useful for a variety of synchronization problems and then show how they can be assembled into solutions. After each problem, the book offers a hint before showing a solution, giving students a better chance of discovering solutions on their own.

The book covers the classical problems, including “Readers-writers,” “Producer-consumer”, and “Dining Philosophers.” In addition, it collects a number of not-so-classical problems, some written by the author and some by other teachers and textbook writers. Readers are invited to create and submit new problems.

If you want a deep understanding of concurrency, this looks like a very good place to start!

Some of the more colorful problem names:

• The dining savages problem
• The Santa Claus problem
• The unisex bathroom problem
• The Senate Bus problem

There are problems (and patterns) for your discovery and enjoyment!

I first saw this in a tweet by Computer Science.

## CSV Fingerprints

August 21st, 2014

CSV Fingerprints by Victor Powell.

From the post:

CSV is a simple and common format for tabular data that uses commas to separate rows and columns. Nearly every spreadsheet and database program lets users import from and export to CSV. But until recently, these programs varied in how they treated special cases, like when the data itself has a comma in it.

It’s easy to make a mistake when you try to make a CSV file fit a particular format. To make it easier to spot mistakes, I’ve made a “CSV Fingerprint” viewer (named after the “Fashion Fingerprints” from The New York Times’s “Front Row to Fashion Week” interactive ). The idea is to provide a birdseye view of the file without too much distracting detail. The idea is similar to Tufte’s Image Quilts…a qualitative view, as opposed to a rendering of the data in the file themselves. In this sense, the CSV Fingerprint is a sort of meta visualization.

This is very clever. Not only can you test a CSV snippet on the webpage, but the source code is on Github. https://github.com/setosa/csv-fingerprint (source code)

Of course, it does rely on the most powerful image processing system known to date. Err, that would be you.

Pass this along. I can imagine any number of data miners who will be glad you did.

## Math for machine learning

August 20th, 2014

Math for machine learning by Zygmunt Zając.

From the post:

Sometimes people ask what math they need for machine learning. The answer depends on what you want to do, but in short our opinion is that it is good to have some familiarity with linear algebra and multivariate differentiation.

Linear algebra is a cornerstone because everything in machine learning is a vector or a matrix. Dot products, distance, matrix factorization, eigenvalues etc. come up all the time.

Differentiation matters because of gradient descent. Again, gradient descent is almost everywhere*. It found its way even into the tree domain in the form of gradient boosting – a gradient descent in function space.

We file probability under statistics and that’s why we don’t mention it here.

Following this introduction you will find a series of books, MOOCs, etc. on linear algebra, calculus and other math resources.

Pass it along!

## Mapping Out Lambda Land:…

August 20th, 2014

From the post:

Anyone who has met me will probably know that I am wildly enthusiastic about functional programming (FP). I co-founded a group for women in FP, have presented a series of talks and workshops about functional concepts, and have even been known to create lambda-branded clothing and jewellery. In this blog post, I will try to give some insight into what the fuss is about. I will briefly explain what functional programming is, why you should care, and how you can use OpenShift to learn more about FP.

With the publicity around OpenShift and functional programming, it seems entirely reasonable to put them together.

Katie gives you a quick overview of functional programming along with resources and next steps for your OpenShift account.

I first saw this in a post by Jonathan Murray.

## Web Annotation Working Group (Preventing Semantic Rot)

August 20th, 2014

Web Annotation Working Group

From the post:

The W3C Web Annotation Working Group is chartered to develop a set of specifications for an interoperable, sharable, distributed Web annotation architecture. The chartered specs consist of:

1. Abstract Annotation Data Model
2. Data Model Vocabulary
3. Data Model Serializations
4. HTTP API
5. Client-side API

The working group intends to use the Open Annotation Data Model and Open Annotation Extension specifications, from the W3C Open Annotation Community Group, as a starting point for development of the data model specification.

The Robust Link Anchoring specification will be jointly developed with the WebApps WG, where many client-side experts and browser implementers participate.

Some good news for the middle of a week!

Shortcomings to watch for:

Can annotations be annotated?

Can non-Web addressing schemes be used by annotators?

Can the structure of files (visible or not) in addition to content be annotated?

If we don’t have all three of those capabilities, then the semantics of annotations will rot, just as semantics of earlier times have rotted away. The main distinction is that most of our ancestors didn’t choose to allow the rot to happen.

I first saw this in a tweet by Rob Sanderson.

## Not just the government’s playbook

August 20th, 2014

Not just the government’s playbook by Mike Loukides.

From the post:

Whenever I hear someone say that “government should be run like a business,” my first reaction is “do you know how badly most businesses are run?” Seriously. I do not want my government to run like a business — whether it’s like the local restaurants that pop up and die like wildflowers, or megacorporations that sell broken products, whether financial, automotive, or otherwise.

If you read some elements of the press, it’s easy to think that healthcare.gov is the first time that a website failed. And it’s easy to forget that a large non-government website was failing, in surprisingly similar ways, at roughly the same time. I’m talking about the Common App site, the site high school seniors use to apply to most colleges in the US. There were problems with pasting in essays, problems with accepting payments, problems with the app mysteriously hanging for hours, and more.

I don’t mean to pick on Common App; you’ve no doubt had your own experience with woefully bad online services: insurance companies, Internet providers, even online shopping. I’ve seen my doctor swear at the Epic electronic medical records application when it crashed repeatedly during an appointment. So, yes, the government builds bad software. So does private enterprise. All the time. According to TechRepublic, 68% of all software projects fail. We can debate why, and we can even debate the numbers, but there’s clearly a lot of software #fail out there — in industry, in non-profits, and yes, in government.

With that in mind, it’s worth looking at the U.S. CIO’s Digital Services Playbook. It’s not ideal, and in many respects, its flaws reveal its origins. But it’s pretty good, and should certainly serve as a model, not just for the government, but for any organization, small or large, that is building an online presence.

See Mike’s post for the extracted thirteen (13) principles (plays in Obama-speak) for software projects.

While everybody needs a reminder, what puzzles me is that none of the principles are new. That being the case, shouldn’t we be asking:

Why haven’t projects been following these rules?

Reasoning that if we (collectively) know what makes software projects succeed, what are the barrier to implementing those steps in all software projects?

Re-stating rules that we already know to be true, without more, isn’t very helpful. Projects that start tomorrow with have a fresh warning in their ears and commit the same errors that doom 68% of all other projects.

My favorite suggestion and the one I have seen violated most often is:

Bring in experienced teams

I am told, “…our staff don’t know how to do X, Y or Z….” That sounds to me like a personnel problem. In an IT recession, a problem that isn’t hard to fix. But no, the project has to succeed with IT staff known to lack the project management or technical skills to succeed. You can guess the outcome of such projects in advance.

The restatement of project rules isn’t a bad thing to have but your real challenge is going to be following them. Suggestions for everyone’s benefit welcome!

## International Conference on Functional Programming 2014

August 20th, 2014

The 19th ACM SIGPLAN International Conference on Functional Programming Complete Proceedings of ICFP 2014 available for free for one year.

I count thirty-one (31) papers that you can access for the next year.

Be aware the ACM has imposed a petty 10 second “wait” screen even though the articles are available without charge.

I first saw this in a tweet by David Van Horn.

## Exposing Resources in Datomic…

August 20th, 2014

Exposing Resources in Datomic Using Linked Data by Ratan Sebastian.

From the post:

Financial data feeds from various data providers tend to be closed off from most people due to high costs, licensing agreements, obscure documentation, and complicated business logic. The problem of understanding this data, and providing access to it for our application is something that we (and many others) have had to solve over and over again. Recently at Pellucid we were faced with three concrete problems

1. Adding a new data set to make data visualizations with. This one was a high-dimensional data set and we were certain that the queries that would be needed to make the charts had to be very parameterizable.

2. We were starting to come to terms with the difficulty of answering support questions about the data we use in our charts given that we were serving up the data using a Finagle service that spoke a binary protocol over TCP. Support staff should not have to learn Datomic’s highly expressive query language, Datalog or have to set up a Scala console to look at the raw data that was being served up.

3. Different data sets that we use had semantically equivalent data that was being accessed in ways specific to that data set.

And as a long-term goal we wanted to be able to query across data sets instead of doing multiple queries and joining in memory.

These are very orthogonal goals to be sure. We embarked on a project which we thought might move us in those three directions simultaneously. We’d already ingested the data set from the raw file format into Datomic, which we love. Goal 2 was easily addressable by conveying data over a more accessible protocol. And what’s more accessible than REST. Goal 1 meant that we’d have expose quite a bit of Datalog expressivity to be able to write all the queries we needed. And Goal 3 hinted at the need for some way to talk about things in different data silos using a common vocabulary. Enter the Linked Data Platform. A W3C project, the need for which is brilliantly covered in this talk. What’s the connection? Wait for it…

The RDF Datomic Mapping

If you are happy with Datomic and RDF primitives, for semantic purposes, this may be all you need.

You have to appreciate Ratan’s closing sentiments:

We believe that a shared ontology of financial data could be very beneficial to many and open up the normally closeted world of handling financial data.

Even though we know as a practical matter that no “shared ontology of financial data” is likely to emerge.

In the absence of such a shared ontology, there are always topic maps.

## Deep Learning (MIT Press Book)

August 20th, 2014

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow, and Aaron Courville.

From the webpage:

Draft chapters available for feedback – August 2014
Please help us make this a great book! This draft is still full of typos and can be improved in many ways. Your suggestions are more than welcome. Do not hesitate to contact any of the authors directly by e-mail or Google+ messages: Yoshua, Ian, Aaron.

Teaching a subject isn’t the only way to learn it cold. Proofing a book on a subject is another way to learn material cold.

I first saw this in a tweet by Gregory Piatetsky

## Wandora 2014-08-20 Release

August 20th, 2014

Wandora 2014-08-20 Release

From the change log:

For a file with the distribution date, in case you have multiple versions, try http://sourceforge.net/projects/wandora/files/?source=navbar.

In the latest round of new features, Rekognition extractor and Alchemy API image keyword extractor are the two I am most likely to try first. Images are one of the weakest forms of evidence but they still carry the imprimatur of being “photographic.”

What photo collection will you be tagging first?

## High Performance With Apache Tez (webinar)

August 19th, 2014

From the post:

This week we continue our YARN webinar series with detailed introduction and a developer overview of Apache Tez. Designed to express fit-to-purpose data processing logic, Tez enables batch and interactive data processing applications spanning TB to PB scale datasets. Tez offers a customizable execution architecture that allows developers to express complex computations as dataflow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.

Tez graduated to Apache top-level project in July 2014 and is now the workhorse of Apache Hive. With Tez, Hive 0.13 is of a magnitude faster than its previous generation. To learn more on Tez, join us on Thursday August 21st at 9 AM Pacific Time. We will review

• Tez Architecture
• Developer APIs
• Sample code

Discover and Learn

Something to get you in shape for the Fall!

## CRDTs: Consistency without consensus

August 19th, 2014

CRDTs: Consistency without consensus by Peter Bourgon.

Abstract:

When you think of distributed systems, you probably think in terms of consistency via consensus. That is, enabling a heterogeneous group of systems to agree on facts, while remaining robust in the face of failure. But, as any distributed systems developer can attest, it’s harder than it sounds. Failure happens in myriad, byzantine ways, and failure modes interact unpredictably. Reliable distributed systems need more than competent engineering: they need a robust theoretical foundation. CRDTs, or Convergent Replicated Data Types, are a set of properties or behaviors, discovered more than invented, which enable a distributed system to achieve consistency without consensus, and sidestep entire classes of problems altogether. This talk provides a practical introduction to CRDTs, and describes a production CRDT system built at SoundCloud to serve high-volume time-series data.

Slides: bbuzz14-peter_bourgon_0.pdf

This is very much worth your time!

Great discussion of data models after time mark 23:00 (approximately).

BTW, the system discussed is open source and in production: http://github.com/soundcloud/roshi

## Solr-Wikipedia

August 19th, 2014

Solr-Wikipedia

From the webpage:

A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.

I haven’t tried this, yet, but utilities for major data sources are always welcome!

## Getting started in Clojure…

August 19th, 2014

Getting started in Clojure with IntelliJ, Cursive, and Gorilla

part 1: setup

part 2: workflow

From Part 1:

This video goes through, step-by-step, how to setup a productive Clojure development environment from scratch. This part looks at getting the software installed and running. The second part to this video (vimeo.com/103812557) then looks at the sort of workflow you could use with this environment.

If you follow through both videos you’ll end up with Leiningen, IntelliJ, Cursive Clojure and Gorilla REPL all configured to work together

Nothing surprising but useful you are just starting out.

## Seeing Things Art Historians Don’t

August 19th, 2014

When A Machine Learning Algorithm Studied Fine Art Paintings, It Saw Things Art Historians Had Never Noticed: Artificial intelligence reveals previously unrecognised influences between great artists

From the post:

The task of classifying pieces of fine art is hugely complex. When examining a painting, an art expert can usually determine its style, its genre, the artist and the period to which it belongs. Art historians often go further by looking for the influences and connections between artists, a task that is even trickier.

So the possibility that a computer might be able to classify paintings and find connections between them at first glance seems laughable. And yet, that is exactly what Babak Saleh and pals have done at Rutgers University in New Jersey.

These guys have used some of the latest image processing and classifying techniques to automate the process of discovering how great artists have influenced each other. They have even been able to uncover influences between artists that art historians have never recognised until now.

At first I thought the claim was that computer saw something art historians did not. That’s not hard. The question is whether you can convince anyone else to see what you saw.

I stumbled a bit on figure 1 both in the post and in the paper. The caption for figure 1 in the article says:

Figure 1: An example of an often cited comparison in the context of influence. Left: Diego Vel´azquez’s Portrait of Pope Innocent X (1650), and, Right: Francis Bacon’s Study After Vel´azquez’s Portrait of Pope Innocent X (1953). Similar composition, pose, and subject matter but a different view of the work.

Well, not exactly. Bacon never saw the original Portrait of Pope Innocent X but produced over forty-five variants of it. It wasn’t a question of “influence” but of subsequent interpretations of the portrait. Not really the same thing as influence. See: Study after Velázquez’s Portrait of Pope Innocent X

I feel certain this will be a useful technique for exploration but naming objects in a painting would result in a large number of painting of popes sitting in chairs. Some of which may or may not have been “influences” in subsequent artists.

Or to put it another way, concluding influence, based on when artists lived, is a post hoc ergo propter hoc fallacy. Good technique to find possible places to look but not a definitive answer.

The original post was based on: Toward Automated Discovery of Artistic Influence

Abstract:

Considering the huge amount of art pieces that exist, there is valuable information to be discovered. Examining a painting, an expert can determine its style, genre, and the time period that the painting belongs. One important task for art historians is to find influences and connections between artists. Is influence a task that a computer can measure? The contribution of this paper is in exploring the problem of computer-automated suggestion of influences between artists, a problem that was not addressed before in a general setting. We first present a comparative study of different classification methodologies for the task of fine-art style classification. A two-level comparative study is performed for this classification problem. The first level reviews the performance of discriminative vs. generative models, while the second level touches the features aspect of the paintings and compares semantic-level features vs. low-level and intermediate-level features present in the painting. Then, we investigate the question “Who influenced this artist?” by looking at his masterpieces and comparing them to others. We pose this interesting question as a knowledge discovery problem. For this purpose, we investigated several painting-similarity and artist-similarity measures. As a result, we provide a visualization of artists (Map of Artists) based on the similarity between their works

I first saw this in a tweet by yarapavan.

## Deep Learning for NLP (without Magic)

August 19th, 2014

Deep Learning for NLP (without Magic) by Richard Socher and Christopher Manning.

Abstract:

Machine learning is everywhere in today’s NLP, but by and large machine learning amounts to numerical optimization of weights for human designed representations and features. The goal of deep learning is to explore how computers can take advantage of data to develop features and representations appropriate for complex interpretation tasks. This tutorial aims to cover the basic motivation, ideas, models and learning algorithms in deep learning for natural language processing. Recently, these methods have been shown to perform very well on various NLP tasks such as language modeling, POS tagging, named entity recognition, sentiment analysis and paraphrase detection, among others. The most attractive quality of these techniques is that they can perform well without any external hand-designed resources or time-intensive feature engineering. Despite these advantages, many researchers in NLP are not familiar with these methods. Our focus is on insight and understanding, using graphical illustrations and simple, intuitive derivations. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable, rather than black boxes labeled “magic here”. The first part of the tutorial presents the basics of neural networks, neural word vectors, several simple models based on local windows and the math and algorithms of training via backpropagation. In this section applications include language modeling and POS tagging. In the second section we present recursive neural networks which can learn structured tree outputs as well as vector representations for phrases and sentences. We cover both equations as well as applications. We show how training can be achieved by a modified version of the backpropagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Applications include sentiment analysis and paraphrase detection. We also draw connections to recent work in semantic compositionality in vector spaces. The principle goal, again, is to make these methods appear intuitive and interpretable rather than mathematically confusing. By this point in the tutorial, the audience members should have a clear understanding of how to build a deep learning system for word-, sentence- and document-level tasks. The last part of the tutorial gives a general overview of the different applications of deep learning in NLP, including bag of words models. We will provide a discussion of NLP-oriented issues in modeling, interpretation, representational power, and optimization.

A tutorial on deep learning from NAACL 2013, Atlanta. The webpage offers links to the slides (205), video of the tutorial, and additional resources.

Definitely a place to take a dive into deep learning.

On page 35 of the slides the following caught my eye:

The vast majority of rule-based and statistical NLP work regards words as atomic symbols: hotel, conference, walk.

In vector space terms, this is a vector with one 1 and a lot of zeroes.

[000000000010000]

Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)

We call this a “one-hot” representation. Its problem:

motel [000000000010000] AND
hotel [000000010000000] = 0


Another aspect of topic maps comes to the fore!

You can have “one-hot” representations of subjects in a topic map, that is a single identifier, but that’s not required.

You can have multiple “one-hot” representations for a subject or you can have more complex collections of properties that represent a subject. Depends on your requirements, not a default of the technology.

If “one-hot” representations of subjects are insufficient for deep learning, shouldn’t they be insufficient for humans as well?

## Complete Antarctic Map

August 19th, 2014

Waterloo makes public most complete Antarctic map for climate research

From the post:

The University of Waterloo has unveiled a new satellite image of Antarctica, and the imagery will help scientists all over the world gain new insight into the effects of climate change.

Thanks to a partnership between the Canadian Space Agency (CSA), MacDonald, Dettwiler and Associates Ltd. (MDA), the prime contractor for the RADARSAT-2 program, and the Canadian Cryospheric Information Network (CCIN) at UWaterloo, the mosaic is free and fully accessible to the academic world and the public.

Using Synthetic Aperture Radar with multiple polarization modes aboard the RADARSAT-2 satellite, the CSA collected more than 3,150 images of the continent in the autumn of 2008, comprising a single pole-to-coast map covering all of Antarctica. This is the first such map of the area since RADARSAT-1 created one in 1997.

You can access the data at: Polar Data Catalogue.

From the Catalogue homepage:

The Polar Data Catalogue is a database of metadata and data that describes, indexes, and provides access to diverse data sets generated by Arctic and Antarctic researchers. The metadata records follow ISO 19115 and Federal Geographic Data Committee (FGDC) standard formats to provide exchange with other data centres. The records cover a wide range of disciplines from natural sciences and policy, to health and social sciences. The PDC Geospatial Search tool is available to the public and researchers alike and allows searching data using a mapping interface and other parameters.

What data would you associate with such a map?

I first saw this at: Most complete Antarctic map for climate research made public.

August 18th, 2014

From the post:

Inspired by Brent Yorgey, I’m finally going public with a draft of my dissertation!

My thesis is that a certain kind of data structures, which I call “lattice-based data structures” or “LVars” for short, lend themselves well to guaranteed-deterministic parallel programming. My dissertation combines material from various alreadypublished papers, making it a three-papers-stapled-together dissertation in some sense, but I’m also retconning a lot of my work to make it tell the story I want to tell now.

When people ask what the best introduction to LVars is, I have trouble recommending the first LVars paper; even though it was only published a year ago, my thinking has changed quite a lot as my collaborators and I have figured things out since then, and the paper doesn’t match the way I like to present things now. So I’m hoping that my dissertation will be something I can point to as the definitive introduction to LVars.1

The latest draft is here; it’s automatically updated every time I commit to the repo.2 Because I thought it might be useful to my committee to see my thought process, I left my “peanut gallery” comments in there: notes to myself begin with “LK: ” and are in red, and TODOs are in a brighter red and begin with “TODO: ”. And, as you can see, there are still many TODOs — but it’s more or less starting to look like a dissertation. (Unlike Brent, I’m doing this a little late in the game: I’ve actually already sent a draft to my committee, and my defense is in only three weeks, on September 8. Still, I’m happy for any feedback, even at this late date; I probably won’t turn in a final version until some time after my defense, so there’s no rush.)

I’ll echo Brent in saying that if you notice typos or grammatical errors, feel free to put in a pull request. However, if you have any more substantial comments or suggestions, please send me an email (lindsey at this domain) instead.

What do you say?

Ready to offer some eyes on a proposal for guaranteed-deterministic parallel programming?

I’m interested both from the change tracking perspective of ODF as well as parallel processing of topic maps.

## Data Science at the Command Line [Webcast Weds. 20 Aug. 2014]

August 18th, 2014

Data Science at the Command Line by Jeroen Janssens.

From the post:

Data Science at the Command Line is a new book written by Jeroen Janssens. This website currently contains information about this Wednesday’s webcast, instructions on how to install the Data Science Toolbox, and an overview of all the command-line tools discussed in the book.

I count eighty-one (81) command line tools listed with short explanations. That alone makes it worth visiting the page.

BTW, there is a webcast Wednesday:

On August 20, 2014 at 17:00 UTC, I’ll be doing a two-hour webcast hosted by O’Reilly Media. Attendance is free, but you do need to sign up. This event will be recorded and shared afterwards.

During this hands-on webcast, you’ll be able to interact not only with me, but also with other attendants. (So far, about 1200 people have signed up!) This means that in two hours, you can learn a lot about how to use the command line for doing data science.

Enjoy!

I first saw this in a tweet by Stat Fact.

## Topic Maps Are For Data Janitors

August 18th, 2014

From the post:

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

Data formats are one challenge, but so is the ambiguity of human language. Iodine, a new health start-up, gives consumers information on drug side effects and interactions. Its lists, graphics and text descriptions are the result of combining the data from clinical research, government reports and online surveys of people’s experience with specific drugs.

But the Food and Drug Administration, National Institutes of Health and pharmaceutical companies often apply slightly different terms to describe the same side effect. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.

Plenty of progress is still to be made in easing the analysis of data. “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff,” said Michael Cavaretta, a data scientist at Ford Motor, which has used big data analysis to trim inventory levels and guide changes in car design.

Mr. Cavaretta is familiar with the work of ClearStory, Trifacta, Paxata and other start-ups in the field. “I’d encourage these start-ups to keep at it,” he said. “It’s a good problem, and a big one.”

Topic maps were only fifteen (15) years ahead of the need of Big Data for them.

How do you avoid:

That kind of painstaking work must be repeated, time and again, on data projects.

?

By annotating data once using a topic map and re-using that annotation over and over again.

By creating already annotated data using a topic map and reusing that annotation over and over again.

Recalling that topic map annotations can represent “logic” but more importantly, can represent any human insight that can be expressed about data.

See Lohr’s post for startups and others who are talking about a problem the topic maps community solved fifteen years ago.

## AverageExplorer:…

August 17th, 2014

AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections, Jun-Yan Zhu, Yong Jae Lee, and Alexei Efros.

Abstract:

This paper proposes an interactive framework that allows a user to rapidly explore and visualize a large image collection using the medium of average images. Average images have been gaining popularity as means of artistic expression and data visualization, but the creation of compelling examples is a surprisingly laborious and manual process. Our interactive, real-time system provides a way to summarize large amounts of visual data by weighted average(s) of an image collection, with the weights reflecting user-indicated importance. The aim is to capture not just the mean of the distribution, but a set of modes discovered via interactive exploration. We pose this exploration in terms of a user interactively “editing” the average image using various types of strokes, brushes and warps, similar to a normal image editor, with each user interaction providing a new constraint to update the average. New weighted averages can be spawned and edited either individually or jointly. Together, these tools allow the user to simultaneously perform two fundamental operations on visual data: user-guided clustering and user-guided alignment, within the same framework. We show that our system is useful for various computer vision and graphics applications.

Applying averaging to images, particularly in an interactive context with users, seems like a very suitable strategy.

What would it look like to have interactive merging of proxies based on data ranges controlled by the user?

## Value-Loss Conduits?

August 17th, 2014

Do you remove links from materials that you quote?

I ask because of the following example:

The research, led by Alexei Efros, associate professor of electrical engineering and computer sciences, will be presented today (Thursday, Aug. 14) at the International Conference and Exhibition on Computer Graphics and Interactive Techniques, or SIGGRAPH, in Vancouver, Canada.

“Visual data is among the biggest of Big Data,” said Efros, who is also a member of the UC Berkeley Visual Computing Lab. “We have this enormous collection of images on the Web, but much of it remains unseen by humans because it is so vast. People have called it the dark matter of the Internet. We wanted to figure out a way to quickly visualize this data by systematically ‘averaging’ the images.”

Which is a quote from: New tool makes a single picture worth a thousand – and more – images by Sarah Yang.

Those passages were reprinted by Science Daily reading:

The research, led by Alexei Efros, associate professor of electrical engineering and computer sciences, was presented Aug. 14 at the International Conference and Exhibition on Computer Graphics and Interactive Techniques, or SIGGRAPH, in Vancouver, Canada.

“Visual data is among the biggest of Big Data,” said Efros, who is also a member of the UC Berkeley Visual Computing Lab. “We have this enormous collection of images on the Web, but much of it remains unseen by humans because it is so vast. People have called it the dark matter of the Internet. We wanted to figure out a way to quickly visualize this data by systematically ‘averaging’ the images.”

Why leave out the hyperlinks for SIGGRAPH and the Visual Computing Laboratory?

Or for that matter, the link to the original paper: AverageExplorer: Interactive Exploration and Alignment of Visual Data Collections (ACM Transactions on Graphics, SIGGRAPH paper, August 2014) which appeared in the news release.

If so, we need to clue ScienceDaily and other content repackagers to include hyperlinks passed onto them, at least.

If you can’t be a value-add, at least don’t be a value-loss conduit.

## TCP Stealth

August 17th, 2014

From the post:

System administrators who aren’t down with spies commandeering their servers might want to pay attention to this one: A Friday article in German security publication Heise provided technical detail on a GCHQ program called HACIENDA, which the British spy agency apparently uses to port-scan entire countries, and the authors have come up with an Internet Engineering Task Force draft for a new technique to counter this program.

The refreshing aspect of this vulnerability is that the details are being discussed in public, as it a partial solution.

Perhaps this is a step towards transparency for cybersecurity. Keeping malicious actors and “security researchers” only in the loop hasn’t worked out so well.

Whether governments fall into “malicious actors” or “security researchers” I leave to your judgement.

## Bizarre Big Data Correlations

August 17th, 2014

Chance News 99 reported the following story:

The online lender ZestFinance Inc. found that people who fill out their loan applications using all capital letters default more often than people who use all lowercase letters, and more often still than people who use uppercase and lowercase letters correctly.

ZestFinance Chief Executive Douglas Merrill says the company looks at tens of thousands of signals when making a loan, and it doesn’t consider the capital-letter factor as significant as some other factors—such as income when linked with expenses and the local cost of living.

So while it may take capital letters into consideration when evaluating an application, it hasn’t held a loan up because of it.

Submitted by Paul Alper

If it weren’t an “online lender,” ZestFinance could take into account applications signed in crayon.

Chance News collects stories with a statistical or probability angle. Some of them can be quite amusing.

## Titan 0.5 Released!

August 16th, 2014

Titan 0.5 Released!

From the Titan documentation:

1.1. General Titan Benefits

• Support for very large graphs. Titan graphs scale with the number of machines in the cluster.
• Support for very many concurrent transactions and operational graph processing. Titan’s transactional capacity scales with the number of machines in the cluster and answers complex traversal queries on huge graphs in milliseconds.
• Support for global graph analytics and batch graph processing through the Hadoop framework.
• Support for geo, numeric range, and full text search for vertices and edges on very large graphs.
• Native support for the popular property graph data model exposed by Blueprints.
• Native support for the graph traversal language Gremlin.
• Easy integration with the Rexster graph server for programming language agnostic connectivity.
• Numerous graph-level configurations provide knobs for tuning performance.
• Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous super node problem.
• Provides an optimized disk representation to allow for efficient use of storage and speed of access.
• Open source under the liberal Apache 2 license.

A major milestone in the development of Titan!

If you are interested in serious graph processing, Titan is one of the systems that should be on your short list.

PS: Matthias Broecheler has posted Titan 0.5.0 GA Release, which has links to upgrade instructions and comments about a future Titan 1.0 release!

August 15th, 2014

our new robo-reader overlords by Alan Jacobs.

After you read this post by Jacobs, be sure to spend time with Flunk the robo-graders by Les Perelman (quoted by Jacobs).

Both raise the issue of what sort of writing can be taught by algorithms that have no understanding of writing?

In a very real sense, the outcome can only be writing that meets but does not exceed what has been programmed into an algorithm.

That is frightening enough for education, but if you are relying on AI or machine learning for intelligence analysis, your stakes may be far higher.

To be sure, software can recognize “send the atomic bomb triggers by Federal Express to this address….,” or at least I hope that is within the range of current software. But what if the message is: “The destroyer of worlds will arrive next week.” Alert? Yes/No? What if it was written in Sanskrit?

I think computers, along with AI and machine learning can be valuable tools, but not if they are setting the standard for review. At least if you don’t want to dumb down writing and national security intelligence to the level of an algorithm.

I first saw this in a tweet by James Schirmer.

## Data Science (StackExchange Beta)

August 15th, 2014

Data Science

Data science has a StackExchange in beta!

A great place to demonstrate your data science chops!

I first saw this in a tweet by Christophe Lalanne.

## Applauding The Ends, Not The Means

August 15th, 2014

From the post:

Microsoft is also scanning for child-abuse images.

A recent tip-off from Microsoft to the National Center for Missing & Exploited Children (NCMEC) hotline led to the arrest on 31 July 2014 of a 20-year-old Pennsylvanian man in the US.

According to the affidavit of probable cause, posted on Smoking Gun, Tyler James Hoffman has been charged with receiving and sharing child-abuse images.

Shades of the days when Kodak would censor film submitted for development.

Lisa reviews the PhotoDNA techniques used by Microsoft and concludes:

The recent successes of PhotoDNA in leading both Microsoft and Google to ferret out child predators is a tribute to Microsoft’s development efforts in coming up with a good tool in the fight against child abuse.

In this particular instance, given this particular use of hash identifiers, it sounds as though those innocent of this particular type of crime have nothing to fear from automated email scanning.

No sane person supports child abuse so the outcome of the case doesn’t bother me.

However, the use of PhotoDNA isn’t limited to photos of abused children. The same technique could be applied to photos of police officers abusing protesters (wonder where you would find those?), etc.

Before anyone applauds Microsoft for taking the role of censor (in the Roman sense), remember that corporate policies change. The goals of email scanning may not be so agreeable tomorrow.