## Archive for January, 2013

### …Repeated Hedonic Experiences

Monday, January 28th, 2013

The Temporal and Focal Dynamics of Volitional Reconsumption: A Phenomenological Investigation of Repeated Hedonic Experiences by Cristel Antonia Russell and Sidney J. Levy, Journal of Consumer Research, Vol. 39, No.2 (August 2012) (pp 341-359).

Abstract:

Volitional reconsumption refers to experiences that consumers actively and consciously seek to experience again. Phenomenological interviews centered on the rereading of books, the rewatching of movies, and the revisiting of geographic places reveal the temporal and focal dimensions of hedonic volitional reconsumption phenomenon and five dominant categories therein. Consumers navigate within and between reconsumption experiences in a hyperresponsive and experientially controlled manner. The dynamics in time and focus fueled by the reconsumed object allow emotional efficiency, as consumers optimize the search for and attainment of the emotional outcomes sought in volitional reconsumption, and facilitate existential understanding, as the linkages across past, present, and future experiences enable an active synthesis of time and promote self-reflexivity. Consumers gain richer and deeper insights into the reconsumption object itself but also an enhanced awareness of their own growth in understanding and appreciation through the lens of the reconsumption object.

The research doesn’t mention any of my favorite repeated hedonic experiences but each to his own. 😉

Still, there is much here that will be useful to those investigating/testing interface designs.

Which do you think will attract more users?

An interface that is remembered with fear and loathing, or, one that made you look smart or helped you impress the boss with your work?

Remembering that utility in some objective sense is probably far down the ladder.

How else to explain users wading in search engine results every day? Search results are not without utility but the minimum level is a pretty low bar.

### What the Blind Eye Sees…

Monday, January 28th, 2013

What the Blind Eye Sees: Incidental Change Detection as a Source of Perceptual Fluency by Stewart A. Shapiro and Jesper H. Nielsen. Journal of Consumer Research, http://www.jstor.org/stable/10.1086/667852.

Abstract:

As competition for consumer attention continues to increase, marketers must depend in part on effects from advertising exposure that result from less deliberate processing. One such effect is processing fluency. Building on the change detection literature, this research brings a dynamic perspective to fluency research. Three experiments demonstrate that brand logos and product depictions capture greater fluency when they change location in an advertisement from one exposure to the ad to the next. As a consequence, logo preference and brand choice are enhanced. Evidence shows that spontaneous detection of the location change instigates this process and that change detection is incidental in nature; participants in all three experiments were unable to accurately report which brand logos or product depictions changed location across ad exposures. These findings suggest that subtle changes to ad design across repeated exposures can facilitate variables of import to marketers, even when processing is minimal.

Does this have implications for graphic presentation of data?

That is should parts of a data visualization that you wish to highlight, change slightly upon each presentation of the data?

Or would the same be true for presentation of important content to remember?

### Building a grammar for statistical graphics in Clojure

Monday, January 28th, 2013

Building a grammar for statistical graphics in Clojure by Kevin Lynagh.

From the description:

Our data is typically optimized for use by computers; what would it be like if we optimized for humans? This talk introduces a grammar of graphics for concisely expressing rich data visualizations. The grammar, implemented in Clojure, consists of simple data structures and can be used across the JVM and via JSON. This talk will cover principles of effective data visualization and the benefits of using data structures as an “API”. There will be lots of pictures and a touch of code.

Fear not! It isn’t dull at all.

Part 1 starts with a quick overview of visualization followed by aesthetic rules for graphics and short discussion of D3 and solutions.

Part 2 is starts with mentions of: The Grammar of Graphics by Leland Wilkinson and Ggplot2 : elegant graphics for data analysis by Hadley Wickham. Kein doesn’t like the R logo (I don’t guess you can please everyone) and R in general.

Suggests that a more modern language, Clojure, which is based on Lisp (another modern language?) ;-), is easier to use. I leave religious debates about languages to others.

I do think he has a good point about decomplecting functions.

More materials:

Handout for the presentation. Useful additional references.

C2PO (in private alpha). The Clojure library demonstrated in the presentation.

### Learn how to make Data Visualizations with D3.js

Monday, January 28th, 2013

Learn how to make Data Visualizations with D3.js

Part 1 – From Zero to Binding Data

Part 2 – Using Data to Create Data Visualizations

Consider signing up for the D3.js Weekly Newsletter while at Dashing D3.js. You won’t be disappointed.

### …Everything You Always Wanted to Know About Genes

Monday, January 28th, 2013

Toward a New Model of the Cell: Everything You Always Wanted to Know About Genes

From the post:

Turning vast amounts of genomic data into meaningful information about the cell is the great challenge of bioinformatics, with major implications for human biology and medicine. Researchers at the University of California, San Diego School of Medicine and colleagues have proposed a new method that creates a computational model of the cell from large networks of gene and protein interactions, discovering how genes and proteins connect to form higher-level cellular machinery.

“Our method creates ontology, or a specification of all the major players in the cell and the relationships between them,” said first author Janusz Dutkowski, PhD, postdoctoral researcher in the UC San Diego Department of Medicine. It uses knowledge about how genes and proteins interact with each other and automatically organizes this information to form a comprehensive catalog of gene functions, cellular components, and processes.

“What’s new about our ontology is that it is created automatically from large datasets. In this way, we see not only what is already known, but also potentially new biological components and processes — the bases for new hypotheses,” said Dutkowski.

Originally devised by philosophers attempting to explain the nature of existence, ontologies are now broadly used to encapsulate everything known about a subject in a hierarchy of terms and relationships. Intelligent information systems, such as iPhone’s Siri, are built on ontologies to enable reasoning about the real world. Ontologies are also used by scientists to structure knowledge about subjects like taxonomy, anatomy and development, bioactive compounds, disease and clinical diagnosis.

A Gene Ontology (GO) exists as well, constructed over the last decade through a joint effort of hundreds of scientists. It is considered the gold standard for understanding cell structure and gene function, containing 34,765 terms and 64,635 hierarchical relations annotating genes from more than 80 species.

“GO is very influential in biology and bioinformatics, but it is also incomplete and hard to update based on new data,” said senior author Trey Ideker, PhD, chief of the Division of Genetics in the School of Medicine and professor of bioengineering in UC San Diego’s Jacobs School of Engineering.

The conclusion to A gene ontology inferred from molecular networks (Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker, Nature Biotechnology 31, 38–45 (2013) doi:10.1038/nbt.2463), illustrates a difference between ontology in the GO sense and that produced by the authors:

The research reported in this manuscript raises the possibility that, given the appropriate tools, ontologies might evolve over time with the addition of each new network map or high-throughput experiment that is published. More importantly, it enables a philosophical shift in bioinformatic analysis, from a regime in which the ontology is viewed as gold standard to one in which it is the major result. (emphasis added)

Ontology as representing reality as opposed to declaring it.

That is a novel concept.

### Pathfinding Algorithms for Changing Graphs

Monday, January 28th, 2013

Pathfinding Algorithms for Changing Graphs

The algorithms summarized at the end of the discussion:

The increased interest in graph based information processing promises this will be an active area of research.

I first saw this in a tweet by GraphHopper.

### Planets, Stars and Stellar Systems: Volume 2: Astronomical Techniques, Software, and Data

Monday, January 28th, 2013

A data mining book I am unlikely to ever see.

Why? The “discount” price at Amazon saves me $93.36. That should be a hint. List price?$509.00, discount price: \$415.64, for a 550 page book.

From the description:

This volume on “Astronomical Techniques, Software, and Data” edited by Howard E. Bond presents accessible review chapters on Astronomical Photometry, Astronomical Spectroscopy, Sky Surveys, Absolute Calibration of Spectrophotometric Standard Stars, Astronomical Polarimetry: polarized views of stars and planets, Infrared Astronomy Fundamentals, Techniques of Radio Astronomy, Radio and Optical Interferometry: Basic Observing Techniques and Data Analysis, Statistical Methods for Astronomy, Numerical Techniques in Astrophysics, Virtual Observatories, Data Mining, and Astroinformatics.

Given the cross-fertilization between fields on data mining techniques, Springer could profit from changing its “price by institutional subscriber base” policies.

But that would require marketing of titles to readers, not simply shipping them to customers who put them on shelves.

### Paper Machines: About Cards & Catalogs, 1548-1929

Sunday, January 27th, 2013

Paper Machines: About Cards & Catalogs, 1548-1929 by Markus Krajewski, translated by Peter Krapp.

From the webpage:

Today on almost every desk in every office sits a computer. Eighty years ago, desktops were equipped with a nonelectronic data processing machine: a card file. In Paper Machines, Markus Krajewski traces the evolution of this proto-computer of rearrangeable parts (file cards) that became ubiquitous in offices between the world wars.

The story begins with Konrad Gessner, a sixteenth-century Swiss polymath who described a new method of processing data: to cut up a sheet of handwritten notes into slips of paper, with one fact or topic per slip, and arrange as desired. In the late eighteenth century, the card catalog became the librarian’s answer to the threat of information overload. Then, at the turn of the twentieth century, business adopted the technology of the card catalog as a bookkeeping tool. Krajewski explores this conceptual development and casts the card file as a “universal paper machine” that accomplishes the basic operations of Turing’s universal discrete machine: storing, processing, and transferring data. In telling his story, Krajewski takes the reader on a number of illuminating detours, telling us, for example, that the card catalog and the numbered street address emerged at the same time in the same city (Vienna), and that Harvard University’s home-grown cataloging system grew out of a librarian’s laziness; and that Melvil Dewey (originator of the Dewey Decimal System) helped bring about the technology transfer of card files to business.

Before ordering a copy, you may want to read Alistair Black’s review. Despite an overall positive impression, Alistair records:

Be warned, Paper Machines is not an easy read. It is not just that in some sections the narrative jumps around, points already ﬁrmly made are needlessly repeated, the characters in the plot are not always introduced carefully enough, and a great deal seems to have been lost in translation. More serious than these difﬁculties, the book is written entirely in the present tense. This is both disconcerting and distracting. I’m surprised the editorial team (the book is part of a monograph series titled “History and Foundations of Information Science”) and a publisher as reputable as the MIT Press allowed this to happen; unless, that is, the original German version was itself written in the present tense, which for a historical discourse I would ﬁnd bafﬂing.

Alistair does conclude:

My ﬁnal advice with respect to this book: it is a good addition to the emerging ﬁeld of information history and the reader should persevere with it, despite its deﬁciencies in narrative style. The excellent illustrations will help in this regard.

Taking Alistair’s comments at face value, I would have to agree that correction of them would make the book an easier read.

On the other hand, working through Paper Machines and perhaps developing references in addition to those given, will give many hours of delight.

### …[D]emocratization of modeling, simulations, and predictions

Sunday, January 27th, 2013

Technical engine for democratization of modeling, simulations, and predictions by Justyna Zander and Pieter J. Mosterman. (Justyna Zander and Pieter J. Mosterman. 2012. Technical engine for democratization of modeling, simulations, and predictions. In Proceedings of the Winter Simulation Conference (WSC ’12). Winter Simulation Conference , Article 228 , 14 pages.)

Abstract:

Computational science and engineering play a critical role in advancing both research and daily-life challenges across almost every discipline. As a society, we apply search engines, social media, and selected aspects of engineering to improve personal and professional growth. Recently, leveraging such aspects as behavioral model analysis, simulation, big data extraction, and human computation is gaining momentum. The nexus of the above facilitates mass-scale users in receiving awareness about the surrounding and themselves. In this paper, an online platform for modeling and simulation (M&S) on demand is proposed. It allows an average technologist to capitalize on any acquired information and its analysis based on scientifically-founded predictions and extrapolations. The overall objective is achieved by leveraging open innovation in the form of crowd-sourcing along with clearly defined technical methodologies and social-network-based processes. The platform aims at connecting users, developers, researchers, passionate citizens, and scientists in a professional network and opens the door to collaborative and multidisciplinary innovations. An example of a domain-specific model of a pick and place machine illustrates how to employ the platform for technical innovation and collaboration.

It is an interesting paper but when speaking of integration of models the authors say:

The integration is performed in multiple manners. Multi-domain tools that become accessible from one common environment using the cloud-computing paradigm serve as a starting point. The next step of integration happens when various M&S execution semantics (and models of computation (cf., Lee and Sangiovanni-Vincentelli 1998; Lee 2010) are merged and model transformations are performed.

That went by too quickly for me. You?

The question of effective semantic integration is an important one.

The U.S. federal government publishes enough data to map where some of the dark data is waiting to be found.

The good, bad or irrelevant data churned out every week, makes the amount of effort required an ever increasing barrier to its use by the public.

Perhaps that is by design?

What do you think?

### New DataCorps Project: Refugees United

Sunday, January 27th, 2013

New DataCorps Project: Refugees United

From the post:

We are thrilled to announce the kick-off of a new DataKind project with Refugees United! Refugees United is a fantastic organization that uses mobile and web technologies to help refugees find their missing loved ones. Currently, RU’s system allows people to post descriptions of their family and friends as well as to search for them on the site. As you might imagine, lots of data flows through this system – data that could be used to greatly improve the way people find each other. Lead by the ever-brilliant Max Shron, the DataKind team is collaborating with Refugees United to explore what their data can tell them about how people are using the site, how they’re connecting to one another and, ultimately, how it can be used to help people find each other more effectively.

We are incredibly excited to work on this project and will be posting updates for you all as things unfoled. In the meantime, learn a bit more about Max and Refugees United.

I can’t comment on the identity practices because:

Q: 1.08 Why isn’t Refugees United open source yet?

Refugees United was born as an “offline” open source project. When we started, we were two guys (now six guys and a girl in Copenhagen, joined by a much larger team worldwide) with a great idea that had the potential to positively impact thousands, if not millions, of lives. The open source approach came from the fact that we wanted to build the world’s smallest refugee agency with the largest outreach, and to have the highest impact at the lowest cost.

One way to reach our objectives is to work with corporations around that world, including Ericsson, SAP, FedEx and others. The invaluable advice and expertise provided by these successful businesses – both the largest corporations and the smallest companies – have helped us to apply the structure and strategy of business to the passion and vision of an NGO.

Now the time has come for us to apply same structure to our software, and we have begun to collaborate with some of the wonderfully brilliant minds out there who wish to contribute and help us make a difference in the development of our technologies.

I am not sure what ‘”offline” open source’ means? The rest of the quoted prose doesn’t help.

Perhaps the software will become available online. At some point.

Would be a interesting data point to see how they are managing personal subject identity.

### Getting real-time field values in Lucene

Sunday, January 27th, 2013

Getting real-time field values in Lucene by Mike McCandless.

From the post:

We know Lucene’s near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It’s simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

I saw a webinar by Mike McCandless that is probably the only webinar I would ever repeat watching.

Organized, high quality technical content, etc.

Compare that to a recent webinar I watched that spent fify-five (55) minutes reviewing information know to anyone who could say the software’s name. The speaker then lamented the lack of time to get into substantive issues.

When you see a webinar like Mike’s, drop me a line. We need to promote that sort of presentation over the other.

### Information field theory

Sunday, January 27th, 2013

Information field theory

From the webpage:

Information field theory (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields. A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe. IFT describes how data and knowledge can be used to infer field properties. Mathematically it is a statistical field theory and exploits many of the tools developed for such. Practically, it is a framework for signal processing and image reconstruction.

IFT is fully Bayesian. How else can infinitely many field degrees of freedom be constrained by finite data?

It can be used without the knowledge of Feynman diagrams. There is a full toolbox of methods.

It reproduces many known well working algorithms. This should be reassuring.

And, there were certainly previous works in a similar spirit. See below for IFT publications and previous works.

Anyhow, in many cases IFT provides novel rigorous ways to extract information from data.

Please, have a look! The specific literature is listed below and more general highlight articles on the right hand side.

Just in case you want to be on the cutting edge of information extraction. 😉

And you might note that Feynman diagrams are graphic representations (maps) of complex mathematical equations.

### NIFTY: Numerical information field theory for everyone

Sunday, January 27th, 2013

NIFTY: Numerical information field theory for everyone

From the post:

Signal reconstruction algorithms can now be developed more elegantly because scientists at the Max Planck Institute for Astrophysics released a new software package for data analysis and imaging, NIFTY, that is useful for mapping in any number of dimensions or spherical projections without encoding the dimensional information in the algorithm itself. The advantage is that once a special method for image reconstruction has been programmed with NIFTY it can easily be applied to many other applications. Although it was originally developed with astrophysical imaging in mind, NIFTY can also be used in other areas such as medical imaging.

Behind most of the impressive telescopic images that capture events at the depths of the cosmos is a lot of work and computing power. The raw data from many instruments are not vivid enough even for experts to have a chance at understanding what they mean without the use of highly complex imaging algorithms. A simple radio telescope scans the sky and provides long series of numbers. Networks of radio telescopes act as interferometers and measure the spatial vibration modes of the brightness of the sky rather than an image directly. Space-based gamma ray telescopes identify sources by the pattern that is generated by the shadow mask in front of the detectors. There are sophisticated algorithms necessary to generate images from the raw data in all of these examples. The same applies to medical imaging devices, such as computer tomographs and magnetic resonance scanners.

Previously each of these imaging problems needed a special computer program that is adapted to the specifications and geometry of the survey area to be represented. But many of the underlying concepts behind the software are generic and ideally would just be programmed once if only the computer could automatically take care of the geometric details.

With this in mind, the researchers in Garching have developed and now released the software package NIFTY that makes this possible. An algorithm written using NIFTY to solve a problem in one dimension can just as easily be applied, after a minor adjustment, in two or more dimensions or on spherical surfaces. NIFTY handles each situation while correctly accounting for all geometrical quantities. This allows imaging software to be developed much more efficiently because testing can be done quickly in one dimension before application to higher dimensional spaces, and code written for one application can easily be recycled for use in another.

NIFTY stands for “Numerical Information Field Theory”. The relatively young field of Information Field Theory aims to provide recipes for optimal mapping, completely exploiting the information and knowledge contained in data. NIFTY now simplifies the programming of such formulas for imaging and data analysis, regardless of whether they come from the information field theory or from somewhere else, by providing a natural language for translating mathematics into software.

Your computer is more powerful than those used to develop generations of atomic bombs.

A wealth of scientific and other data is as close as the next Ethernet port.

So, what have you discovered lately?

NIFTY is a reminder that discovery is a question of will, not availability of resources.

From the NIFTY webpage:

NIFTY [1], “Numerical Information Field Theory”, is a versatile library designed to enable the development of signal inference algorithms that operate regardless of the underlying spatial grid and its resolution. Its object-oriented framework is written in Python, although it accesses libraries written in Cython, C++, and C for efficiency.

NIFTY offers a toolkit that abstracts discretized representations of continuous spaces, fields in these spaces, and operators acting on fields into classes. Thereby, the correct normalization of operations on fields is taken care of automatically without concerning the user. This allows for an abstract formulation and programming of inference algorithms, including those derived within information field theory. Thus, NIFTY permits its user to rapidly prototype algorithms in 1D and then apply the developed code in higher-dimensional settings of real world problems. The set of spaces on which NIFTY operates comprises point sets, n-dimensional regular grids, spherical spaces, their harmonic counterparts, and product spaces constructed as combinations of those.

I first saw this at: Software Package for All Types of Imaging, with the usual fun and games of running down useful links.

### Creating beautiful maps with R

Sunday, January 27th, 2013

Creating beautiful maps with R by David Smith.

From the post:

Spanish R user and solar energy lecturer Oscar Perpiñán Lamigueiro has written a detailed three-part guide to creating beautiful maps and choropleths (maps color-coded with regional data) using the R language. Motivated by the desire to recreate this graphic from the New York Times, Oscar describes how he creates similar high-quality maps using R.

David summarizes the three part series by Oscar Perpiñán Lamigueiro with links to parts, software and data.

No guarantees you will produce maps as good as the New York Times but it won’t be from a lack of instruction. 😉

### Maps in R: choropleth maps

Sunday, January 27th, 2013

Maps in R: choropleth maps by Max Marchi.

From the post:

This is the third article of the Maps in R series. After having shown how to draw a map without placing data on it and how to plot point data on a map, in this installment the creation of a choropleth map will be presented.

A choropleth map is a thematic map featuring regions colored or shaded according to the value assumed by the variable of interest in that particular region.

Another step towards becoming a map maker with R!

### Computational Information Geometry

Sunday, January 27th, 2013

Computational Information Geometry by Frank Nielsen.

From the homepage:

Computational information geometry deals with the study and design of efficient algorithms in information spaces using the language of geometry (such as invariance, distance, projection, ball, etc). Historically, the field was pioneered by C.R. Rao in 1945 who proposed to use the Fisher information metric as the Riemannian metric. This seminal work gave birth to the geometrization of statistics (eg, statistical curvature and second-order efficiency). In statistics, invariance (by non-singular 1-to-1 reparametrization and sufficient statistics) yield the class of f-divergences, including the celebrated Kullback-Leibler divergence. The differential geometry of f-divergences can be analyzed using dual alpha-connections. Common algorithms in machine learning (such as clustering, expectation-maximization, statistical estimating, regression, independent component analysis, boosting, etc) can be revisited and further explored using those concepts. Nowadays, the framework of computational information geometry opens up novel horizons in music, multimedia, radar, and finance/economy.

Numerous resources including publications, links to conference proceedings (some with videos), software and other materials, including a tri-lingual dictionary, Japanese, English, French, of terms in information geometry.

### Dictionary of computational information geometry

Sunday, January 27th, 2013

Dictionary of computational information geometry (PDF) by Frank Nielsen. (Compiled January 23, 2013)

The title is a bit misleading.

It should read: “[Tri-Lingual] Dictionary of computational information geometry.”

Terms are defined in:

Japanese-English

English-Japanese

Japanese-French

An excellent resource in a linguistically diverse world!

### Is Google Hijacking Semantic Markup/Structured Data? [FALSE]

Saturday, January 26th, 2013

Is Google Hijacking Semantic Markup/Structured Data? by Barbara Starr.

From the post:

On December 12, 2012, Google rolled out a new tool, called the Google Data Highlighter for event data. Upon a cursory read, it seems to be a tagging tool, where a human trains the Data Highlighter using a few pages on their website, until Google can pick up enough of a pattern to do the remainder of the site itself.

Better yet, you can see all of these results in the structured data dashboard. It appears as if event data is marked up and is compatible with schema.org. However, there is a caveat here that some folks may not notice.

No actual markup is placed on the page, meaning that none of the semantic markup using this Data Highlighter tool is consumable by Bing, Yahoo or any other crawler on the Web; only Google can use it!

Google is essentially hi-jacking semantic markup so only Google can take advantage of it. Google has the global touch and the ability to execute well-thought-out and brilliantly strategic plans.

Let’s do this by the numbers:

1. Google develops a service for webmasters to add semantic annotations to their webpages.
2. Google allows webmasters to use that service at no charge.

Google used its own resources to develop a valuable service for webmasters that enhances their websites and user experience with Google, for free.

Perhaps there is a new definition of highjacking?

Webster says the traditional definition includes “to steal or rob as if by hijacking.”

The Semantic Web:

Highjacking

(a) Failing to whitewash the Semantic Web’s picket fence while providing free services to webmasters and users to enhance searching of web content.

(b) Failing to give away data from free services to webmasters and users to those who did not plant, reap, spin, weave or sew.

I don’t find the Semantic Web’s definition of “hijacking” persuasive.

You?

I first saw this at: Google’s Structured Data Take Over by Angela Guess.

### DataFu: The WD-40 of Big Data

Saturday, January 26th, 2013

DataFu: The WD-40 of Big Data by Sam Shah.

From the post:

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank, set operations, and bag operations.

It’s helpful to understand the history of the library. Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came PigUnit, which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have DataFu.

So what can this library do for you? Let’s look at one of the classical examples that showcase the power and flexibility of Pig: sessionizing a click steam.

DataFu

The UDF bag and set operations are likely to be of particular interest.

### Human Computation and Crowdsourcing

Saturday, January 26th, 2013

From the conference website:

Where

Palm Springs, California
Venue information coming soon

When

November 7-9, 2013

Important Dates

All deadlines are 5pm Pacific time unless otherwise noted.

Papers

Author rebuttal period: June 21-28

Workshops & Tutorials

Posters & Demonstrations

From the post:

Announcing HCOMP 2013, the Conference on Human Computation and Crowdsourcing, Palm Springs, November 7-9, 2013. Paper submission deadline is May 1, 2013. Thanks to the HCOMP community for bringing HCOMP to life as a full conference, following on the successful workshop series.

The First AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2013) will be held November 7-9, 2013 in Palm Springs, California, USA. The conference was created by researchers from diverse fields to serve as a key focal point and scholarly venue for the review and presentation of the highest quality work on principles, studies, and applications of human computation. The conference is aimed at promoting the scientific exchange of advances in human computation and crowdsourcing among researchers, engineers, and practitioners across a spectrum of disciplines. Papers submissions are due May 1, 2013 with author notification on July 16, 2013. Workshop and tutorial proposals are due May 10, 2013. Posters & demonstrations submissions are due July 25, 2013.

I suppose it had to happen.

I first saw this at: New AAAI Conference on Human Computation and Crowdsourcing by Shar Steed.

### The Neophyte’s Guide to Scala Part [n]…

Saturday, January 26th, 2013

Daniel Westheide has a series of posts introducing Scala to Neophytes.

As of today:

The Neophyte’s Guide to Scala Part 1: Extractors

The Neophyte’s Guide to Scala Part 2: Extracting Sequences

The Neophyte’s Guide to Scala Part 3: Patterns Everywhere

The Neophyte’s Guide to Scala Part 4: Pattern Matching Anonymous Functions

The Neophyte’s Guide to Scala Part 5: The Option type

The Neophyte’s Guide to Scala Part 6: Error handling with Try

The Neophyte’s Guide to Scala Part 7: The Either type

The Neophyte’s Guide to Scala Part 8: Welcome to the Future

The Neophyte’s Guide to Scala Part 9: Promises and Futures in practice

The Neophyte’s Guide to Scala Part 10: Staying DRY with higher-order functions

Apologies for not seeing this sooner.

Makes a nice starting place for the 25th March 2013 Functional Programming Principles in Scala class by Martin Odersky.

I first saw this at Chris Cundill’s This week in #Scala (26/01/2013).

### Functional Programming Principles in Scala

Saturday, January 26th, 2013

Functional Programming Principles in Scala by Martin Odersky.

March 25th 2013 (7 weeks long)

From the webpage:

This course introduces the cornerstones of functional programming using the Scala programming language. Functional programming has become more and more popular in recent years because it promotes code that’s safe, concise, and elegant. Furthermore, functional programming makes it easier to write parallel code for today’s and tomorrow’s multiprocessors by replacing mutable variables and loops with powerful ways to define and compose functions.

Scala is a language that fuses functional and object-oriented programming in a practical package. It interoperates seamlessly with Java and its tools. Scala is now used in a rapidly increasing number of open source projects and companies. It provides the core infrastructure for sites such as Twitter, LinkedIn, Foursquare, Tumblr, and Klout.

In this course you will discover the elements of the functional programming style and learn how to apply them usefully in your daily programming tasks. You will also develop a solid foundation for reasoning about functional programs, by touching upon proofs of invariants and the tracing of execution symbolically.

The course is hands on; most units introduce short programs that serve as illustrations of important concepts and invite you to play with them, modifying and improving them. The course is complemented by a series of assignments, most of which are also programming projects.

In case you missed it last time.

I first saw this at Chris Cundill’s This week in #Scala (26/01/2013).

### *SEM 2013 […Independence to be Semantically Diverse]

Saturday, January 26th, 2013

*SEM 2013 : The 2nd Joint Conference on Lexical and Computational Semantics

Dates:

When Jun 13, 2013 – Jun 14, 2013
Where Atlanta GA, USA
Final Version Due Apr 21, 2013

From the call:

The main goal of *SEM is to provide a stable forum for the growing number of NLP researchers working on different aspects of semantic processing, which has been scattered over a large array of small workshops and conferences.

Topics of interest include, but are not limited to:

• Formal and linguistic semantics
• Cognitive aspects of semantics
• Lexical semantics
• Semantic aspects of morphology and semantic processing of morphologically rich languages
• Semantic processing at the sentence level
• Semantic processing at the discourse level
• Semantic processing of non-propositional aspects of meaning
• Textual entailment
• Multiword expressions
• Multilingual semantic processing
• Social media and linguistic semantics

*SEM 2013 will feature a distinguished panel on Deep Language Understanding.

*SEM 2013 hosts the shared task on Semantic Textual Similarity.

Another workshop to join the array of “…small workshops and conferences.” 😉

Not a bad thing. Communities grow up around conferences and people you will see at one are rarely at others.

Diversity of communities, dare I say semantics?, isn’t a bad thing. It is a reflection of our diversity and we should stop beating ourselves up over it.

Our machines are capable of being uniformly monotonous. But that is because they lack the independence to be diverse on their own.

Why would anyone want to emulate being a machine?

### SPARQL with R in less than 5 minutes [Fire Data]

Saturday, January 26th, 2013

SPARQL with R in less than 5 minutes

From the post:

In this article we’ll get up and running on the Semantic Web in less than 5 minutes using SPARQL with R. We’ll begin with a brief introduction to the Semantic Web then cover some simple steps for downloading and analyzing government data via a SPARQL query with the SPARQL R package.

What is the Semantic Web?

To newcomers, the Semantic Web can sound mysterious and ominous. By most accounts, it’s the wave of the future, but it’s hard to pin down exactly what it is. This is in part because the Semantic Web has been evolving for some time but is just now beginning to take a recognizable shape (DuCharme 2011). Detailed definitions of the Semantic Web abound, but simply put, it is an attempt to structure the unstructured data on the Web and to formalize the standards that make that structure possible. In other words, it’s an attempt to create a data definition for the Web.

I will have to re-read Bob Ducharme’s “Learning SPARQL.” I didn’t realize the “Semantic Web” was beginning to “…take a recognizable shape.” After a decade of attempting to find an achievable agenda, it’s about time.

The varying interpretations of Semantic Web origin tales are quite amusing. In the first creation account, independent agents were going to schedule medical appointments and tennis matches for us. In the second account, our machine were going to reason across structured data to produce new insights. More recently, the vision is of a web of CMU Coke machines connected to the WWW, along with other devices. (The Internet of Things.)

I suppose the next version will be computers that can exchange information using the TCP/IP protocol and various standards, like HTML, for formatting documents. Plus some declaration that semantics will be handled in a future version, sufficiently far off to keep grant managers from fearing an end to the project.

The post is a good example of using R to use SPARQL and you will encounter data at SPARQL endpoints so it is a useful exercise.

The example data set is one of wildfires and acres burned per year, 1960-2008.

More interesting fire data sets can be found at: Fire Detection GIS Data.

Mapping that data by date, weather conditions/trends, known impact, would require coordination between diverse data sets.

### Machine Learning Cheat Sheet (for scikit-learn)

Saturday, January 26th, 2013

Machine Learning Cheat Sheet (for scikit-learn) by Andreas Mueller.

From the post:

(Click for a larger version)

BTW, scikit-learn is doing a user survey.

Take a few minutes to contribute your feedback.

Saturday, January 26th, 2013

From the webpage:

The Advanced Data mining And Machine learning System (ADAMS) is a novel, flexible workflow engine aimed at quickly building and maintaining real-world, complex knowledge workflows.

Same source as WEKA.

What if we think about identification as workflow?

Whatever stability we attribute to an identification is the absence of additional data that would create a change.

Looking backwards over prior identifications, we fit them into the schema of our present identification and that eliminates any movement from the past. The past is fixed and terminates in our present identification.

That view fails to appreciate the world isn’t going to end with any of us individually. The world and its information systems will continue, as will the workflow that defines identifications.

Replacing our identifications with newer ones.

The question we face is whether our actions will support or impede re-use of our identifications in the future.

I first saw Adams Workflow at Nat Torkington’s Four short links: 24 January 2013.

### Multi-tasking with joint semantic spaces

Saturday, January 26th, 2013

From the post:

Hello, and welcome to the Paper of the Day (Po’D): Multi-tasking with joint semantic spaces edition. Today’s paper is: J. Weston, S. Bengio and P. Hamel, “Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval,” J. New Music Research, vol. 40, no. 4, pp. 337-348, 2011.

This article proposes and tests a novel approach (pronounced MUSCLES but written MUSLSE) for describing a music signal along multiple directions, including semantically meaningful ones. This work is especially relevant since it applies to problems that remain unsolved, such as artist identification and music recommendation (in fact the first two authors are employees of Google). The method proposed in this article models a song (or a short excerpt of a song) as a triple in three vector spaces learned from a training dataset: one vector space is created from artists, one created from tags, and the last created from features of the audio. The benefit of using vector spaces is that they bring quantitative and well-defined machinery, e.g., projections and distances.

MUSCLES attempts to learn each vector space together so as to preserve (dis)similarity. For instance, vectors mapped from artists that are similar (e.g., Brittney Spears and Christina Aguilera) should point in nearly the same direction; while those that are not similar (e.g., Engelbert Humperdink and The Rubberbandits), should be nearly orthogonal. Similarly, so should vectors mapped from tags that are semantically close (e.g., “dark” and “moody”), and semantically disjoint (e.g., “teenage death song” and “NYC”). For features extracted from the audio, one hopes the features themselves are comparable, and are able to reflect some notion of similarity at least at the surface level of the audio. MUSCLES takes this a step further to learn the vector spaces so that one can take inner products between vectors from different spaces — which is definitely a novel concept in music information retrieval.

Bob raises a number of interesting issues but here’s one that bites:

A further problem is that MUSCLES judges similarity by magnitude inner product. In such a case, if “sad” and “happy” point in exact opposite directions, then MUSCLES will say they are highly similar.

Ouch! For all the “precision” of vector spaces, there are non-apparent biases lurking therein.

Abstract:

Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modelling audio, artist names, and tags in a single low-dimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multi-task learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

Just to tempt you into reading the article, consider the following passage:

Artist and song similarity is at the core of most music recommendation or playlist generation systems. However, music similarity measures are subjective, which makes it diﬃcult to rely on ground truth. This makes the evaluation of such systems more complex. This issue is addressed in Berenzweig (2004) and Ellis, Whitman, Berenzweig, and Lawrence (2002). These tasks can be tackled using content-based features or meta-data from human sources. Features commonly used to predict music similarity include audio features, tags and collaborative ﬁltering information.

Meta-data such as tags and collaborative ﬁltering data have the advantage of considering human perception and opinions. These concepts are important to consider when building a music similarity space. However, meta-data suﬀers from a popularity bias, because a lot of data is available for popular music, but very little information can be found on new or less known artists. In consequence, in systems that rely solely upon meta-data, everything tends to be similar to popular artists. Another problem, known as the cold-start problem, arises with new artists or songs for which no human annotation exists yet. It is then impossible to get a reliable similarity measure, and is thus diﬃcult to correctly recommend new or less known artists.

“…[H]uman perception[?]…” Is there some other form I am unaware of? Some other measure of similarity than our own? Recalling that vector spaces are a pale mockery of our more subtle judgments.

Suggestions?

### 2013: What’s Coming Next in Neo4j!

Friday, January 25th, 2013

2013: What’s Coming Next in Neo4j! by Philip Rathle.

From the post:

Even though roadmaps can change, and it’s nice not to spoil all of the surprises, we do feel it’s important to discuss priorities within our community. We’ve spent a lot of time over the last year taking to heart all of of the discussions we’ve had, publicly and privately, with our users, and closely looking at the various ways in which Neo4j is used. Our aim in 2013 is to build upon the strengths of today’s Neo4j database, and make a great product even better.

The 2013 product plan breaks down into a few main themes. This post is dedicate to the top two, which are:

1. Ease of Use. Making the product easier to learn, use, and maintain, for new & existing users, and

2. Big(ger) Data. Handling ever-bigger data and transaction volumes.

Philip shares some details (but not all) in the post.

It sounds like 2013 is going to be a good year for Neo4j (and by extension, it users)!

### Linkurious: Visualize Graph Data Easily

Friday, January 25th, 2013

Linkurious: Visualize Graph Data Easily by Alex Popescu.

Alex points to Linkurious, a tool for visualization and exploration fo graph databases (currently only Neo4j).

An open “beta.”

### Chemical datuments as scientific enablers

Friday, January 25th, 2013

Chemical datuments as scientific enablers by Henry S Rzepa. (Journal of Cheminformatics 2013, 5:6 doi:10.1186/1758-2946-5-6)

Abstract:

This article is an attempt to construct a chemical datument as a means of presenting insights into chemical phenomena in a scientific journal. An exploration of the interactions present in a small fragment of duplex Z-DNA and the nature of the catalytic centre of a carbon-dioxide/alkene epoxide alternating co-polymerisation is presented in this datument, with examples of the use of three software tools, one based on Java, the other two using Javascript and HTML5 technologies. The implications for the evolution of scientific journals are discussed.

From the background:

Chemical sciences are often considered to stand at the crossroads of paths to many disciplines, including molecular and life sciences, materials and polymer sciences, physics, mathematical and computer sciences. As a research discipline, chemistry has itself evolved over the last few decades to focus its metaphorical microscope on both far larger and more complex molecular systems than previously attempted, as well as uncovering a far more subtle understanding of the quantum mechanical underpinnings of even the smallest of molecules. Both these extremes, and everything in between, rely heavily on data. Data in turn is often presented in the form of visual or temporal models that are constructed to illustrate molecular behaviour and the scientific semantics. In the present article, I argue that the mechanisms for sharing both the underlying data, and the (semantic) models between scientists need to evolve in parallel with the increasing complexity of these models. Put simply, the main exchange mechanism, the scientific journal, is accepted [1] as seriously lagging behind in its fitness for purpose. It is in urgent need of reinvention; one experiment in such was presented as a data-rich chemical exploratorium [2]. My case here in this article will be based on my recent research experiences in two specific areas. The first involves a detailed analysis of the inner kernel of the Z-DNA duplex using modern techniques for interpreting the electronic properties of a molecule. The second recounts the experiences learnt from modelling the catalysed alternating co-polymerisation of an alkene epoxide and carbon dioxide.

Effective sharing of data, in scientific journals or no, requires either a common semantic (we know that’s uncommon) or a mapping between semantics (how may times must we repeat the same mappings, separately?).

Embedding notions of subject identity and mapping between identifications in chemical datuments could increase the reuse of data, as well as its longevity.