Archive for February, 2014

CDH 4.6, Cloudera Manager 4.8.2, and Search 1.2

Friday, February 28th, 2014

Announcing: CDH 4.6, Cloudera Manager 4.8.2, and Search 1.2 by Justin Kestelyn.

Mostly bug fix releases but now is as good a time as any to upgrade before you are in crunch mode.

New Hue Demos:…

Friday, February 28th, 2014

New Hue Demos: Spark UI, Job Browser, Oozie Scheduling, and YARN Support by Justin Kestelyn.

From the post:

Hue, the open source Web UI that makes Apache Hadoop easier to use, is now a standard across the ecosystem — shipping within multiple software distributions and sandboxes. One of the reasons for its success is an agile developer community behind it that is constantly rolling out new features to its users.

Just as important, the Hue team is diligent in its documentation and demonstration of those new features via video demos. In this post, for your convenience, I bring you the most recent examples (released since December):

  • The new Spark Igniter App
  • Using YARN and Job Browser
  • Job Browser with YARN Security
  • Apache Oozie crontab scheduling

All short but all worthwhile. Nice way to start off your Saturday morning. The kids have cartoons and you have Hue. 😉

Cool Infographics: Best Practices Group on LinkedIn

Friday, February 28th, 2014

Cool Infographics: Best Practices Group on LinkedIn by Randy Krum.

From the post:

I am excited to announce the launch of a new LinkedIn Group, Cool Infographics: Best Practices. I have personally been a part of many great discussion groups over the years and believe that this group fills an unmet need. Please accept this invitation to join the group to share your own experiences and wisdom.

There are many groups that share infographics, but I felt that a discussion group dedicated to the craft of infographics and data visualization was missing. This group will feature questions and case studies about how companies are leveraging infographics and data visualization as a communication tool. Any posts that are just links to infographics will be moderated to keep the focus on engaging discussions. Topics and questions from the Cool Infographics book will also be discussed.

Join us in a professional dialogue surrounding case studies and strategies for designing infographics and using them as a part of an overall marketing strategy. We welcome both beginning and established professionals to share valuable tactics and experiences as well as fans of infographics to learn more about this growing field.

Anyone with a drawing program can create an infographic.

This group is where you may learn to make “cool” infographics.

Think of it as the difference between failing to communicate and communicating.

If you are trying to market an idea, a service or a product, the latter should be your target.

A Gresham’s Law for Crowdsourcing and Scholarship?

Friday, February 28th, 2014

A Gresham’s Law for Crowdsourcing and Scholarship? by Ben W. Brumfield.

Ben examines the difficulties of involving both professionals and “amateurs” in crowd-sourced projects.

The point of controversy being whether or not professionals will decline to be identified by projects that include amateurs?

There isn’t any smoking gun evidence and I suspect the reaction of both professionals and amateurs varies from field to field.

Still, it is something you may run across if you use crowd-sourcing to build semantic annotations and/or data archives.

Dotty open-sourced

Friday, February 28th, 2014

Dotty open-sourced by Martin Odersky.

From the post:

A couple of days ago we open sourced the Dotty, a research platform for new new language concepts and compiler technologies for Scala.

The idea is to provide a platform where new ideas can be tried out without the stringent backwards compatibility constraints of the regular Scala releases. At the same time this is no “castle-in-the-sky” project. We will look only at technologies that have a very good chance of being beneficial to Scala and its ecosystem.

My goal is that the focus of our research and development efforts lies squarely on simplification. In my opinion, Scala has been very successful in its original goal of unifying OOP and FP. But to get there is has acquired some features that in retrospect turned out to be inessential for the main goal, even if they are useful in some applications. XML literals come to mind, as do existential types. In Dotty we try to identify a much smaller set of core features and will then represent other features by encodings into that core.

Right now, there’s a (very early) compiler frontend for a subset of Scala. We’ll work on fleshing this out and testing it against more sources (the only large source it was tested on so far is the dotty compiler itself). We’ll also work on adding transformation and backend phases to make this into a full compiler.

Are you interested in functional programming and adventurous?

If so, this is your stop. 😉

OCW Bookshelf

Friday, February 28th, 2014

OCW Bookshelf

From the post:

MIT OpenCourseWare shares the course materials from classes taught on the MIT campus. In most cases, this takes the form of course documents such as syllabi, lecture notes, assignments and exams.

Occasionally, however, we come across textbooks we can share openly. This page provides an index of textbooks (and textbook-like course notes) that can be found throughout the OCW site. (Note that in most cases, resources are listed below by the course they are associated with.)

Covers a wide range of courses from computer science and engineering to languages and math.

I expect this resource will keep growing so remember to check back from time to time.

R resources for Hydrologists

Thursday, February 27th, 2014

R resources for Hydrologists by Riccardo Rigon.

From the post:

R is my statistical software of election. I had hard time to convince my Ph.D. students to adopt it, but finally they did, and, as usually happens, many of them became more proficient than me in the field. Now it seems natural to use it for everything, but this was not always the case.

An annotated list of R resources for hydrologists, annotated by Riccardo with comments.

Great for hydrologists but also good for anyone who wants to participate in water planning issues for rural or urban areas.

I first saw this in a tweet by Ben Gillespie.

Fractals in D3: Dragon Curves

Thursday, February 27th, 2014

Fractals in D3: Dragon Curves by Stephen Hall.

From the post:

Dragon fractal

This week I am continuing to experiment with rendering fractals in D3. In this post we’re looking at examples of generating some really cool fractals called dragon curves (also referred to as Heighway dragons). This post is a continuation of the previous one on fractal ferns. Take a look at that post if you want some basic info on fractals and some links I found useful. Fractals are a world unto themselves, so there are plenty of interesting things to be investigated in this area. We are just scratching the surface with these two posts.

Great images, complete with source code and explanation.

See the Fractal entry at Wikipeida for more links on fractals.

Introducing Google Maps Gallery…

Thursday, February 27th, 2014

Introducing Google Maps Gallery: Unlocking the World’s Maps by Jordan Breckenridge.

From the post:

Governments, nonprofits and businesses have some of the most valuable mapping data in the world, but it’s often locked away and not accessible to the public. With the goal of making this information more readily available to the world, today we’re launching Google Maps Gallery, a new way for organizations to share and publish their maps online via Google Maps Engine.

Google Map Gallery

Maps Gallery works like an interactive, digital atlas where anyone can search for and find rich, compelling maps. Maps included in the Gallery can be viewed in Google Earth and are discoverable through major search engines, making it seamless for citizens and stakeholders to access diverse mapping data, such as locations of municipal construction projects, historic city plans, population statistics, deforestation changes and up-to-date emergency evacuation routes. Organizations using Maps Gallery can communicate critical information, build awareness and inform the public at-large.

A great site as you would expect from Google.

I happened upon US Schools with GreatSchools Ratings. Created by

There has been a rash of 1950’s style legislative efforts this year in the United States, seeking to permit business to discriminate on the basis of their religious beliefs. Recalling the days when stores sported “We Reserve the Right to Refuse Service to Anyone” signs.

I remember those signs and how they were used.

With that in mind, scroll around the GreatSchools Rating may and tell me what you think the demographics of non-rated schools look like?

That’s what I thought too.

Same Documents – Multiple Trees

Thursday, February 27th, 2014

View the same documents in different ways with multiple trees by Jonathan Stray.

From the post:

Starting today Overview supports multiple trees for each document set. That is, you can tell Overview to re-import your documents — or a subset of them — with different options, without uploading them again. You can use this to:

  • Focus on a subset of your documents, such as those with a particular tag or containing a specific word.
  • Use ignored and important words to try sorting your documents in different ways.

You create a new tree using this button next your document set list page:

OK, overlapping markup it’s not but this looks like a very useful feature!

Mortar PIG Cheat Sheet

Thursday, February 27th, 2014

Mortar PIG Cheat Sheet

From the cheatsheet:

We love Apache Pig for data processing—it’s easy to learn, it works with all kinds of data, and it plays well with Python, Java, and other popular languages. And, of course, Pig runs on Hadoop, so it’s built for high-scale data science.

Whether you’re just getting started with Pig or you’ve already written a variety of Pig scripts, this compact reference gathers in one place many of the tools you’ll need to make the most of your data using Pig 0.12

Easier on the eyes than a one pager!

Not to mention being a good example of how to write and format a cheat sheet.

Eric Holder (AG) Doesn’t Get It By 70%

Thursday, February 27th, 2014

US Attorney General calls for unified data breach notification laws by John Hawes.

From the post:

US Attorney General Eric Holder has put his weight behind a growing wave of pressure to improve how data leaks are handled by companies and institutions.

Interest in improving ways to ensure people are protected from leakage of personal data, and kept informed when such breaches do occur, has boomed since the recent barrage of large-scale, headline-making compromises in retail and tech firms.

Holder used the platform of his weekly video message, posted on the website, to talk about “Protecting Consumers from Cybercrime”.

Responding explicitly to the recent Target and Neiman Markus leaks, the Attorney General demanded Congress get busy developing a “strong national standard” for breach notifications.

He claimed this would make it easier for law enforcement to investigate breaches, make breached entities more accountable for any sloppy security practices and help those whose data has been leaked.

So, breach notifications in response to:

Over time, we have built an IT landscape which consists of many rotten building blocks. Gerald M. Weinberg’s Second Law is often quoted: ‘If builders built buildings the way programmers write programs, then the first woodpecker that came along would destroy civilization.’[1] When it comes to CND, this situation is aggravated by the fact that so-called security software—the very building blocks that we try to use for our defenses—are, by far, of worse quality than anything else.[2] Statistically, not actually using it would be more secure. (Back to Basics: Beyond Network Hygiene by Felix ‘FX’ Lindner and Sandro Gaycken.)

Do you get the impression that Holder doesn’t have a clue about the current security state of the IT landscape?

BTW, the comment about security software being the worse of the lot, footnote [2] in Lindner and Gaycken, is a reference to: Veracode, 2011. State of Software Security Report: The Intractable Problem of Insecure Software. Burlington, MA: Veracode.)

If you think Veracode’s 2011 report is grim reading, visit and request a copy of the 2013 report.

First finding from Veracode State of Software Security Report: Volume 5:

70% of applications failed to comply with enterprise security policies on first submission.
This represents a significant increase in the failure rate of 60% reported in Volume 4. While the applications may eventually become compliant, the high initial failure rate validates the concerns CISOs have regarding application security risks since insecure applications are a leading cause of security breaches and data loss for organizations of all types and sizes.

Hard to say which is worse:A current failure rate of 70% or an increase by 10 points since 2011.

If you can, please point out how national laws on “breech notifications” can change that 70% failure rate.

If you can’t, become a source of conversation on how to change coding and other practices to make security integral to software and not a missing added feature.

Graph Gist Winter Challenge Winners

Wednesday, February 26th, 2014

Graph Gist Winter Challenge Winners by Michael Hunger.

From the post:

We received 65 submissions in the 10+ categories. Well done! (emphasis in original)

There were thirty-three (33) winners across these categories:

  • Education
  • Finance
  • Life Science
  • Manufacturing
  • Resources
  • Retail
  • Telecommunication
  • Transport
  • Advanced Graph Gists
  • Other

Great place to start collecting graph patterns for solving particular data modeling issues!


Wednesday, February 26th, 2014

Atom: A hackable text editor for the 21st Century.

From the webpage:

At GitHub, we’re building the text editor we’ve always wanted. A tool you can customize to do anything, but also use productively on the first day without ever touching a config file. Atom is modern, approachable, and hackable to the core. We can’t wait to see what you build with it.

I can’t imagine anyone improving on Emacs but we might learn something from watching people try. 😉

Some other links of interest:


Twitter: @AtomEditor




Data Access for the Open Access Literature: PLOS’s Data Policy

Wednesday, February 26th, 2014

Data Access for the Open Access Literature: PLOS’s Data Policy by Theo Bloom.

From the post:

Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances. In line with Open Access to research articles themselves, PLOS strongly believes that to best foster scientific progress, the underlying data should be made freely available for researchers to use, wherever this is legal and ethical. Data availability allows replication, reanalysis, new analysis, interpretation, or inclusion into meta-analyses, and facilitates reproducibility of research, all providing a better ‘bang for the buck’ out of scientific research, much of which is funded from public or nonprofit sources. Ultimately, all of these considerations aside, our viewpoint is quite simple: ensuring access to the underlying data should be an intrinsic part of the scientific publishing process.

PLOS journals have requested data be available since their inception, but we believe that providing more specific instructions for authors regarding appropriate data deposition options, and providing more information in the published article as to how to access data, is important for readers and users of the research we publish. As a result, PLOS is now releasing a revised Data Policy that will come into effect on March 1, 2014, in which authors will be required to include a data availability statement in all research articles published by PLOS journals; the policy can be found below. This policy was developed after extensive consultation with PLOS in-house professional and external Academic Editors and Editors in Chief, who are practicing scientists from a variety of disciplines.

We now welcome input from the larger community of authors, researchers, patients, and others, and invite you to comment before March. We encourage you to contact us collectively at; feedback via Twitter and other sources will also be monitored. You may also contact individual PLOS journals directly.

That is a large step towards verifiable research and was taken by PLOS in December of 2013.

That has been supplemented with details that do not change the December announcement in: PLOS’ New Data Policy: Public Access to Data by Liz Silva, which reads in part:

A flurry of interest has arisen around the revised PLOS data policy that we announced in December and which will come into effect for research papers submitted next month. We are gratified to see a huge swell of support for the ideas behind the policy, but we note some concerns about how it will be implemented and how it will affect those preparing articles for publication in PLOS journals. We’d therefore like to clarify a few points that have arisen and once again encourage those with concerns to check the details of the policy or our FAQs, and to contact us with concerns if we have not covered them.

I think the bottom line is: Don’t Panic, Ask.

There are always going to be unanticipated details or concerns but as time goes by and customs develop for how to solve those issues, the questions will become fewer and fewer.

Over time and not that much time, our history of arrangements other than open access are going to puzzle present and future generations of researchers.

The Small Tools Manifesto For Bioinformatics [Security Too?]

Wednesday, February 26th, 2014

The Small Tools Manifesto For Bioinformatics

From the post:

This MANIFESTO describes motives, rules and recommendations for designing software and pipelines for current day biological and biomedical research.

Large scale data acquisition in research has led to fundamental challenges in (1) scaling of calculations, (2) full data integration and (3) data interaction and visualisation. We think that, because of researchers reaching out to turn-key solutions, the research community is losing sight of the importance of building software on the shoulders of giants and providing solutions in a modular, flexible and open way.

This MANIFESTO counters current trends in bioinformatics where institutes and companies are creating monolithic software solutions aimed mostly at end-users. This MANIFESTO builds on the Unix computer tradition of providing small tools that can be used in a modular and pluggable way to create efficient computational solutions where individual parts can be easily replaced. The manifesto also counters current trends in software licensing which are not truly free and open source (FOSS). We think such a MANIFESTO is necessary, even though history suggests that software created with true FOSS licenses will ultimately prevail over less open licenses, including those licenses for academic use only.

Interesting that I should encounter this less than a week after Back to Basics: Beyond Network Hygiene by Felix ‘FX’ Lindner and Sandro Gaycken.

Linder and Gaycken’s first recommendation:

Therefore, our first recommendation is to significantly increase the granularity of the building blocks, making the individual building blocks significantly smaller than what is done today. (emphasis in original)

Think about it.

More granular building blocks means smaller blocks of code and fewer places for bugs to hide. Messaging between blocks allows for easy tracing of bad messages. Not to mention that small blocks can be repaired and/or replaced more easily than large monoliths.

The manifesto for small tools in bioinformatics is a great idea.

Shouldn’t we do the same for programming in general to enable robust computer security?*

*Note that computer security isn’t “in addition to” a computer system but enabled in the granular architecture itself.

I first saw this in a tweet by Vince Buffalo.

715 New Worlds

Wednesday, February 26th, 2014

NASA’s Kepler Mission Announces a Planet Bonanza, 715 New Worlds by Michele Johnson and J.D. Harrington.

From the post:

NASA’s Kepler mission announced Wednesday the discovery of 715 new planets. These newly-verified worlds orbit 305 stars, revealing multiple-planet systems much like our own solar system.

Nearly 95 percent of these planets are smaller than Neptune, which is almost four times the size of Earth. This discovery marks a significant increase in the number of known small-sized planets more akin to Earth than previously identified exoplanets, which are planets outside our solar system.

“The Kepler team continues to amaze and excite us with their planet hunting results,” said John Grunsfeld, associate administrator for NASA’s Science Mission Directorate in Washington. “That these new planets and solar systems look somewhat like our own, portends a great future when we have the James Webb Space Telescope in space to characterize the new worlds.”

Since the discovery of the first planets outside our solar system roughly two decades ago, verification has been a laborious planet-by-planet process. Now, scientists have a statistical technique that can be applied to many planets at once when they are found in systems that harbor more than one planet around the same star.

What have you discovered lately? 😉

The papers:

More about Kepler:

Great discoveries but what else is in the Kepler data that no one is looking for?

Secrets of Cloudera Support:…

Wednesday, February 26th, 2014

Secrets of Cloudera Support: Inside Our Own Enterprise Data Hub by Adam Warrington.

From the post:

Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.

In this post, I’m happy to write about a new feature in CSI, which we call Monocle Stack Trace.

Stack Trace Exploration with Search

Hadoop log messages and the stack traces in those logs are critical information in many of the support cases Cloudera handles. We find that our customer operation engineers (COEs) will regularly search for stack traces they find referenced in support cases to try to determine where else that stack trace has shown up, and in what context it would occur. This could be in the many sources we were already indexing as part of Monocle Search in CSI: Apache JIRAs, Apache mailing lists, internal Cloudera JIRAs, internal Cloudera mailing lists, support cases, Knowledge Base articles, Cloudera Community Forums, and the customer diagnostic bundles we get from Cloudera Manager.

It turns out that doing routine document searches for stack traces doesn’t always yield the best results. Stack traces are relatively long compared to normal search terms, so search indexes won’t always return the relevant results in the order you would expect. It’s also hard for a user to churn through the search results to figure out if the stack trace was actually an exact match in the document to figure out how relevant it actually is.

To solve this problem, we took an approach similar to what Google does when it wants to allow searching over a type that isn’t best suited for normal document search (such as images): we created an independent index and search result page for stack-trace searches. In Monocle Stack Trace, the search results show a list of unique stack traces grouped with every source of data in which unique stack trace was discovered. Each source can be viewed in-line in the search result page, or the user can go to it directly by following a link.

We also give visual hints as to how the stack trace for which the user searched differs from the stack traces that show up in the search results. A green highlighted line in a search result indicates a matching call stack line. Yellow indicates a call stack line that only differs in line number, something that may indicate the same stack trace on a different version of the source code. A screenshot showing the grouping of sources and visual highlighting is below:

See Adam’s post for the details.

I like the imaginative modification of standard search.

Not all data is the same and searching it as if it were, leaves a lot of useful data unfound.

Open science in machine learning

Wednesday, February 26th, 2014

Open science in machine learning by Joaquin Vanschoren, Mikio L. Braun, and Cheng Soon Ong.


We present OpenML and mldata, open science platforms that provides easy access to machine learning data, software and results to encourage further study and application. They go beyond the more traditional repositories for data sets and software packages in that they allow researchers to also easily share the results they obtained in experiments and to compare their solutions with those of others.

From 2 OpenML:

OpenML ( is a website where researchers can share their data sets, implementations and experiments in such a way that they can easily be found and reused by others. It offers a web API through which new resources and results can be submitted automatically, and is being integrated in a number of popular machine learning and data mining platforms, such as Weka, RapidMiner, KNIME, and data mining packages in R, so that new results can be submitted automatically. Vice versa, it enables researchers to easily search for certain results (e.g. evaluations of algorithms on a certain data set), to directly compare certain techniques against each other, and to combine all submitted data in advanced queries.

From 3 mldata:

mldata ( is a community-based website for the exchange of machine learning data sets. Data sets can either be raw data files or collections of files, or use one of the supported file formats like HDF5 or ARFF in which case mldata looks at meta data contained in the files to display more information. Similar to OpenML, mldata can define learning tasks based on data sets, where mldata currently focuses on supervised learning data. Learning tasks identify which features are used for input and output and also which score is used to evaluate the functions. mldata also allows to create learning challenges by grouping learning tasks together, and lets users submit results in the form of predicted labels which are then automatically evaluated.

Interesting sites.

Does raise the question of who will index the indexers of datasets?

I first saw this in a tweet by Stefano Betolo.

Lucene and Solr 4.7

Wednesday, February 26th, 2014

Lucene and Solr 4.7

From the post:

Today Apache Lucene and Solr PMC announced a new of Apache Lucene library and Apache Solr search server – the 4.7 one. This is another release from the 4.x brach bringing new functionalities and bugfixes.

Apache Lucene 4.7 library can be downloaded from the following address: Apache Solr 4.7 can be downloaded at the following URL address: Release note for Apache Lucene 4.7 can be found at:, Solr release notes can be found at:

Time to upgrade, again!

How to uncover a scandal from your couch

Wednesday, February 26th, 2014

How to uncover a scandal from your couch by Brad Racino and Joe Yerardi.

From the post:

News broke in San Diego last week about a mysterious foreign national bent on influencing San Diego politics by illegally funneling money to political campaigns through a retired San Diego police detective and an undisclosed “straw donor.” Now, the politicians on the receiving end of the tainted funds are scrambling to distance themselves from the scandal.

The main piece of information that started it all was an unsealed FBI complaint.

Fifteen pages in length, the report named only the detective, Ernesto Encinas, and a self-described “campaign guru” in DC, Ravneet Singh, as conspirators. The names of everyone else involved were not disclosed. Instead, politicians were “candidates,” the moneymen were called “the Straw Donor” and “the Foreign National,” and informants were labeled “CIs.”

It didn’t take long for reporters to piece together clues — mainly by combing through publicly-accessible information from the San Diego City Clerk’s website — to uncover who was involved.

A great post that walks you through the process of taking a few facts and hints from an FBI compliant and fleshing it out with details.

This could make an exciting exercise for a library class.

I first saw this at: Comment and a link to a SearchResearch-like story by Daniel M. Russell.

Beyond Monads: Practical Category Theory

Wednesday, February 26th, 2014

Beyond Monads: Practical Category Theory by Jim Duey.

From the description:

Category Theory is a rich toolbox of techniques to deal with complexity in building software. Unfortunately, the learning curve is steep with many abstract concepts that must be grasped before their utility is seen. This, coupled with an impenetrable vocabulary, are barriers for the average programmer to level up their capabilities, so that many of these benefits are not realized in industry.

This talk gives the working programmer an avenue to begin using the tools Category Theory gives us by building a series of small steps starting from pure functions. The stage is set by describing how complexity arises from functional side effects. Then moves on to various forms of composition, gradually adding more structure to slice the accidental complexity into manageable pieces. Some of the vocabulary used in Category Theory will be explained to give listeners some signposts in their future study of these concepts.

Listeners will leave this talk with a new perspective on the problems they face in their day-to-day programming tasks and some tools on how to address them, regardless of their chosen programming language.

I would dearly love to see the presentation that went along with these slides!

Jim also lists further references for which I have supplied links:

Monads – Wadler “Monads for Functional Programming

Applicative Functors – McBride/Paterson “Applicative Programming with Effects

Comonads – Kieburtz “Codata and Comonads in Haskell

Arrows – Hughes “Generalizing Monads to Arrows

Jim does supply the link for: Haskell Typeclassopedia

I guess it comes down to this question: Are you now or do you plan to be a parallel programmer?

RDF 1.1: On Semantics of RDF Datasets

Wednesday, February 26th, 2014

RDF 1.1: On Semantics of RDF Datasets


RDF defines the concept of RDF datasets, a structure composed of a distinguished RDF graph and zero or more named graphs, being pairs comprising an IRI or blank node and an RDF graph. While RDF graphs have a formal model-theoretic semantics that determines what arrangements of the world make an RDF graph true, no agreed formal semantics exists for RDF datasets. This document presents some issues to be addressed when defining a formal semantics for datasets, as they have been discussed in the RDF 1.1 Working Group, and specify several semantics in terms of model theory, each corresponding to a certain design choice for RDF datasets.

I can see how not knowing the semantics of a dataset could be problematic.

What puzzles me about this particular effort is that it appears to be an attempt to define the semantics of RDF datasets for others. Yes?

That activity depends upon semantics being inherent in an RDF dataset so that everyone can “discover” the same semantics or that such uniform semantics can be conferred upon an RDF dataset by decree.

The first possibility, that RDF datasets have an inherent semantic need not delay us as this activity started because different people saw different semantics in RDF datasets. That alone is sufficient to defeat any proposal based on “inherent” semantics.

The second possibility, that of defining and conferring semantics, seems equally problematic to me.

In part because there no enforcement mechanism that can prevent users of RDF datasets from assigning any semantic they like to a dataset.

But this remains important work but I would change the emphasis to defining what this group considers to be the semantics of RDF datasets and a mechanism to allow others to signal their agreement with it for a particular dataset.

That has the advantage of other users being able to adopt wholesale an entire set of semantics for an RDF dataset. Which hopefully reflects the semantics with which it should be processed.

Declaring semantics may help avoid users silently using inconsistent semantics for the same datasets.

Web Scraping part2: Digging deeper

Tuesday, February 25th, 2014

Web Scraping part2: Digging deeper by Rolf Fredheim.

From the post:

Slides from the second web scraping through R session: Web scraping for the humanities and social sciences.

In which we make sure we are comfortable with functions, before looking at XPath queries to download data from newspaper articles. Examples including BBC news and Guardian comments.

Download the .Rpres file to use in Rstudio here.

A regular R script with the code only can be accessed here.

A great part 2 on web scrapers!

The Data Mining Group releases PMML v 4.2

Tuesday, February 25th, 2014

The Data Mining Group releases PMML v 4.2

From the announcement:

“As a standard, PMML provides the glue to unify data science and operational IT. With one common process and standard, PMML is the missing piece for Big Data initiatives to enable rapid deployment of data mining models. Broad vendor support and rapid customer adoption demonstrates that PMML delivers on its promise to reduce cost, complexity and risk of predictive analytics,” says Alex Guazzelli, Vice President of Analytics, Zementis. “You can not build and deploy predictive models over big data without using multiple models and no one should build multiple models without PMML,” says Bob Grossman, Founder and Partner at Open Data Group.

Some of the elements that are new to PMML v4.2 include:

  • Improved support for post-processing, model types, and model elements
  • A completely new element for text mining
  • Scorecards now introduce the ability to compute points based on expressions
  • New built-in functions, including “matches” and “replace” for the use of regular expressions

(emphasis added)

Hmmm, do you think they meant before 4.2 they didn’t have “matches” and “replace?” (I checked, they didn’t.)

However, kudos on the presentation of their schema, both current and prior versions.

More XML schemas had such documentation/presentation.

See PMML v.42 General Structure.

I first saw this at: The Data Mining Group releases PMML v4.2 Predictive Modeling Standard.

SEC Filings for Humans

Tuesday, February 25th, 2014

SEC Filings for Humans by Meris Jensen.

After a long and sad history of the failure of the SEC to make EDGAR useful:

Rank and Filed gathers data from EDGAR, indexes it, and returns it in formats meant to help investors research, investigate and discover companies on their own. I started googling ‘How to build a website’ seven months ago. The SEC has web developers, software developers, database administrators, XBRL experts, legions of academics who specialize in SEC filings, and all this EDGAR data already cached in the cloud. The Commission’s mission is to protect investors, maintain fair, orderly and efficient markets, and facilitate capital formation. Why did I have to build this? (emphasis added)

I don’t know the answer to Meris’ question but I can tell you that Rank and Filed is an incredible resource for financial information.

And yet another demonstration that government should not manage open data. Make it available. (full stop)

I first saw this at Nathan Yau’s A human-readable explorer for SEC filings.

Project Open Data

Tuesday, February 25th, 2014

Project Open Data

From the webpage:

Data is a valuable national resource and a strategic asset to the U.S. Government, its partners, and the public. Managing this data as an asset and making it available, discoverable, and usable – in a word, open – not only strengthens our democracy and promotes efficiency and effectiveness in government, but also has the potential to create economic opportunity and improve citizens’ quality of life.

For example, when the U.S. Government released weather and GPS data to the public, it fueled an industry that today is valued at tens of billions of dollars per year. Now, weather and mapping tools are ubiquitous and help everyday Americans navigate their lives.

The ultimate value of data can often not be predicted. That’s why the U.S. Government released a policy that instructs agencies to manage their data, and information more generally, as an asset from the start and, wherever possible, release it to the public in a way that makes it open, discoverable, and usable.

The White House developed Project Open Data – this collection of code, tools, and case studies – to help agencies adopt the Open Data Policy and unlock the potential of government data. Project Open Data will evolve over time as a community resource to facilitate broader adoption of open data practices in government. Anyone – government employees, contractors, developers, the general public – can view and contribute. Learn more about Project Open Data Governance and dive right in and help to build a better world through the power of open data.

An impressive list of tools and materials for federal (United States) agencies seeking to release data.

And as Ryan Swanstrom says in his post:

Best of all, the entire project is available on GitHub and contributions are welcomed.

Thoughts on possible contributions?

OCLC Preview 194 Million…

Tuesday, February 25th, 2014

OCLC Preview 194 Million Open Bibliographic Work Descriptions by Richard Wallis.

From the post:

I have just been sharing a platform, at the OCLC EMEA Regional Council Meeting in Cape Town South Africa, with my colleague Ted Fons. A great setting for a great couple of days of the OCLC EMEA membership and others sharing thoughts, practices, collaborative ideas and innovations.

Ted and I presented our continuing insight into The Power of Shared Data, and the evolving data strategy for the bibliographic data behind WorldCat. If you want to see a previous view of these themes you can check out some recordings we made late last year on YouTube, from Ted – The Power of Shared Data – and me – What the Web Wants.

Today, demonstrating on-going progress towards implementing the strategy, I had the pleasure to preview two upcoming significant announcements on the WorldCat data front:

  1. The release of 194 Million Linked Data Bibliographic Work descriptions
  2. The WorldCat Linked Data Explorer interface

A preview release to be sure but one worth following!

Particularly with 194 million bibliographic work descriptions!

See Ralph’s post for the details.

NOAA Moves to Unleash “Big Data”…

Tuesday, February 25th, 2014

NOAA Moves to Unleash “Big Data” and Calls Upon American Companies to Help by Kathryn Sullivan, Ph.D., Acting Undersecretary of Commerce for Oceans and Atmosphere and Acting NOAA Administrator.

RFI: Deadline March 24, 2014.

From the post:

From the surface of the sun to the depths of the ocean floor, the National Oceanic and Atmospheric Administration (NOAA), part of the Department of Commerce, works to keep citizens informed about the changing environment around them. Our vast network of radars, satellites, buoys, ships, aircraft, tide gauges, and supercomputers keeps tabs on the condition of our planet’s health and provides critical data that are used to predict changes in climate, weather, oceans, and coastlines. As we continue to witness changes on this dynamic planet we call home, the demand for NOAA’s data is only increasing.

Quite simply, NOAA is the quintessential big data agency. Each day, NOAA collects, analyzes, and generates over 20 terabytes of data – twice the amount of data than what is in the United States Library of Congress’ entire printed collection. However, only a small percentage is easily accessible to the public.

NOAA is not the only Commerce agency with a treasure trove of valuable information. The economic and demographic statistics from the Census Bureau, for example, inform business decisions every day. According to a 2013 McKinsey Global Institute Report, open data could add more than $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, health care, and consumer finance sectors worldwide. That is why U.S. Secretary of Commerce Penny Pritzker has made unleashing the power of Commerce data one of the top priorities of the Department’s “Open for Business Agenda.”

All of that to lead up to:

That’s why we have released a Request for Information (RFI) to help us explore the feasibility of this concept and the range of possibilities to accomplish our goal. At no cost to taxpayers, this RFI calls upon the talents of America’s best minds to help us find the data and IT delivery solutions they need and should have.

This was released on February 21, 2014, so at best, potential responders had a maximum of thirty-two (32) days to respond to an RFI which describes the need and data sets in the broadest possible terms.

The “…no cost to taxpayers…” is particularly ironic, since anyone re-marketing the data to the public isn’t going to do so for free. Some public projects may but not the commercial vendors.

A better strategy would be for NOAA to release 10% of each distinct data set collected over the past two years at a cloud download location along with its documentation. Indicate how much data exists for each data set, the project, contact details.

Let real users and commercial vendors rummage through the 10% data to see what is of interest, how it can be processed, etc.

If NOAA wants real innovation, stop trying to manage it.

Managed innovation gets you Booz Allen type results. Is that what you want?

Starting to Demo the Wolfram Language [Echoes of Veg-o-matic?]

Tuesday, February 25th, 2014

Starting to Demo the Wolfram Language by Stephen Wolfram.

From the post:

We’re getting closer to the first official release of the Wolfram Language—so I am starting to demo it more publicly.

Here’s a short video demo I just made. It’s amazing to me how much of this is based on things I hadn’t even thought of just a few months ago. Knowledge-based programming is going to be much bigger than I imagined…

You really need to watch this video.

Impressive demo but how does it run on ordinary hardware and network connectivity?

Not unlike the old Veg-o-matic commercial:

Notice where the table top is locate relative to his waist. Do you have a table like that at home? If not, the Veg-o-matic isn’t going to work the same for you.

Was Stephen on his local supercomputer cluster or a laptop?

Excited but more details needed.