Archive for October, 2012

Coming soon on JAXenter: videos from JAX London [What Does Hardware Know?]

Wednesday, October 31st, 2012

Coming soon on JAXenter: videos from JAX London by Elliot Bentley.

From the post:

Can you believe it’s only been two weeks since JAX London? We’re already planning for the next one at JAX Towers (yes, really).

Yet if you’re already getting nostalgic, never fear – JAXenter is on hand to help you relive those glorious yet fleeting days, and give a taste of what you may have missed.

For a start, we’ve got videos of almost every session in the main room, including keynotes from Doug Cutting, Patrick Debois, Steve Poole and Martijn Verburg & Kirk Pepperdine, which we’ll be releasing gradually onto the site over the coming weeks. Slides for the rest of JAX London’s sessions are already freely available on SlideShare.

Pepperdine and Verburg, “Java and the Machine,” remark:

There’s no such thing as a process as far as the hardware is concerned.

A riff I need to steal to say:

There’s no such thing as semantics as far as the hardware is concerned.

We attribute semantics to data for input, we attribute semantics to processing of data by hardware, we attribute semantics to computational results.

I didn’t see a place for hardware in that statement. Do you?

One To Watch: Apache Crunch

Wednesday, October 31st, 2012

One To Watch: Apache Crunch by Chris Mayer.

From the post:

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).

That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.

A project to definitely keep in sight.

Artificial Intelligence – Fall 2012 – CMU

Wednesday, October 31st, 2012

From the course overview:

Topics:

This course is about the theory and practice of Artificial Intelligence. We will study modern techniques for computers to represent task-relevant information and make intelligent (i.e. satisfying or optimal) decisions towards the achievement of goals. The search and problem solving methods are applicable throughout a large range of industrial, civil, medical, financial, robotic, and information systems. We will investigate questions about AI systems such as: how to represent knowledge, how to effectively generate appropriate sequences of actions and how to search among alternatives to find optimal or near-optimal solutions. We will also explore how to deal with uncertainty in the world, how to learn from experience, and how to learn decision rules from data. We expect that by the end of the course students will have a thorough understanding of the algorithmic foundations of AI, how probability and AI are closely interrelated, and how automated agents learn. We also expect students to acquire a strong appreciation of the big-picture aspects of developing fully autonomous intelligent agents. Other lectures will introduce additional aspects of AI, including unsupervised and on-line learning, autonomous robotics, and economic/game-theoretic decision making.

Learning Objectives

By the end of the course, students should be able to:

1. Identify the type of an AI problem (search, inference, decision making under uncertainty, game theory, etc).
2. Formulate the problem as a particular type. (Example: define a state space for a search problem)
3. Compare the difficulty of different versions of AI problems, in terms of computational complexity and the efficiency of existing algorithms.
4. Implement, evaluate and compare the performance of various AI algorithms. Evaluation could include empirical demonstration or theoretical proofs.

Textbook:

It is helpful, but not required, to have Artificial Intelligence: A Modern Approach / Russel and Norvig.

Judging from the materials on the website, this is a very good course.

Building apps with rail data

Wednesday, October 31st, 2012

Building apps with rail data by Jamie Andrews.

From the post:

Recently we ran a hack day Off the Rails to take the best rail data and see what can be built with it. I remain stunned at the range and quality of the output, particularly because of the complexity of the subject matter, and the fact that a lot of the developers hadn’t built any train-related software before.

So check out all of the impressive, useful and fun train hacks, and marvel at what can be done when data is opened and great minds work together…

The hacks really are impressive so I will just list the titles and hope that induces you to visit Jamie’s post:

Hack #1 – Trainspot.in… FourSquare for trains
Hack #2 – Journey planner maps with lines that follow the tracks
Hack #3 – Scenic railways
Hack #4 – Realtime Dutch trains
Hack #5 – ChooChooTune
Hack #6 – Realtimetrains
Hack #7 – Follow the rails
Hack #8 – [cycling in the UK]
Hack #9 – [I’ll meet you half-way?]
Hack #10 – [train delays, most delayed]

(No titles given for 8-10 so I made up titles.)

MongoSV 2012

Wednesday, October 31st, 2012

MongoSV 2012

From the webpage:

December 4th Santa Clara, CA

MongoSV is an annual one-day conference in Silicon Valley, CA, dedicated to the open source, non-relational database MongoDB.

There are five (5) tracks, morning and afternoon sessions, a final session followed by a conference party from 5:30 PM to 8 PM.

Any summary is going to miss something of interest for someone. Take the time to review the schedule.

While you are there, register for the conference as well. A unique annual opportunity to mix-n-meet with MongoDB enthusiasts!

Wednesday, October 31st, 2012

Make your own buckyball by John D. Cook.

From the post:

This weekend a couple of my daughters and I put together a buckyball from a Zometool kit. The shape is named for Buckminster Fuller of geodesic dome fame. Two years after Fuller’s death, scientists discovered that the shape appears naturally in the form of a C60 molecule, named Buckminsterfullerene in his honor. In geometric lingo, the shape is a truncated icosahedron. It’s also the shape of many soccer balls.

Don’t be embarrassed to use these at the office.

According to the PR, Roger Penrose does.

Wikidata

Wednesday, October 31st, 2012

Wikidata

From the webpage:

Wikidata is a free knowledge base that can be read and edited by humans and machines alike. It is for data what Wikimedia Commons is for media files: it centralizes access and management of structured data, such as interwiki references and statistical information. Wikidata contains data in all languages for which there are Wikimedia projects.

Not fully operational but still quite interesting.

Particularly the re-use of information aspects.

Re-use of data being one advantage commonly found in topic maps.

MongoDB and Fractal Tree Indexes (Webinar) [13 November 2012]

Wednesday, October 31st, 2012

Webinar: MongoDB and Fractal Tree Indexes by Tim Callaghan.

From the post:

This webinar covers the basics of B-trees and Fractal Tree Indexes, the benchmarks we’ve run so far, and the development road map going forward.

Date: November 13th
Time: 2 PM EST / 11 AM PST
REGISTER TODAY

If you aren’t familiar with Fractal Tree Indexes and MongoDB, this is your opportunity to catch up!

LEARN programming by visualizing code execution

Wednesday, October 31st, 2012

LEARN programming by visualizing code execution

From the webpage:

Online Python Tutor is a free educational tool that helps students overcome a fundamental barrier to learning programming: understanding what happens as the computer executes each line of a program’s source code. Using this tool, a teacher or student can write a Python program directly in the web browser and visualize what the computer is doing step-by-step as it executes the program.

Of immediate significance for anyone learning or teaching Python.

Longer range, something similar for merging data from different sources could be useful as well.

At its simplest, representing the percentage of information from particular sources by color, for the map or items in the map. Illustrating, “what if we take away X,” as a source type analysis.

I first saw this at Christophe Lalanne’s A bag of tweets / October 2012.

MDM: It’s Not about One Version of the Truth

Wednesday, October 31st, 2012

MDM: It’s Not about One Version of the Truth by Michele Goetz.

From the post:

Here is why I am not a fan of the “single source of truth” mantra. A person is not one-dimensional; they can be a parent, a friend, a colleague and each has different motivations and requirements depending on the environment. A product is as much about the physical aspect as it is the pricing, message, and sales channel it is sold through. Or, it is also faceted by the fact that it is put together from various products and parts from partners. In no way is a master entity unique or has a consistency depending on what is important about the entity in a given situation. What MDM provides are definitions and instructions on the right data to use in the right engagement. Context is a key value of MDM.

When organizations have implemented MDM to create a golden record and single source of truth, domain models are extremely rigid and defined only within a single engagement model for a process or reporting. The challenge is the master entity is global in nature when it should have been localized. This model does not allow enough points of relationship to create the dimensions needed to extend beyond the initial scope. If you want to now extend, you need to rebuild your MDM model. This is essentially starting over or you ignore and build a layer of redundancy and introduce more complexity and management.

The line:

The challenge is the master entity is global in nature when it should have been localized.

stopped me cold.

What if I said:

“The challenge is a subject proxy is global in nature when it should have been localized.”

Would your reaction be the same?

Shouldn’t subject identity always be local?

Or perhaps better, have you ever experienced a subject identification that wasn’t local?

We may talk about a universal notion of subject but even so we are using a localized definition of universal subject.

If a subject proxy is a container for local identifications, thought to be identifications of the same subject, need we be concerned if it doesn’t claim to be a universal representative for some subject? Or is it sufficient that it is a faithful representative of one or more identifications, thought by some collector to identify the same subject?

I am leaning towards the latter because it jettisons the doubtful baggage of universality.

That is a subject may have more than one collection of local identifications (such collections being subject proxies), none of which is the universal representative for that subject.

Even if we think another collection represents the same subject, merging those collections is a question of your requirements.

PS: You need to read Michele’s original post to discover what could entice management to fund an MDM project. Interoperability of data isn’t it.

The one million tweet map

Tuesday, October 30th, 2012

The one million tweet map

Displays the last one million tweets by geographic location, plus the top five (5) hashtags.

So tweets are not just 140 or less character strings, they are locations as well. Wondering how far you can take re-purposing of a tweet?

I first saw this at Mashable.com.

BTW, I don’t find the Adobe Social ad (part of the video at Mashable) all that convincing.

You?

Solr vs ElasticSearch: Part 4 – Faceting

Tuesday, October 30th, 2012

Solr vs ElasticSearch: Part 4 – Faceting by Rafał Kuć.

From the post:

Solr 4 (aka SolrCloud) has just been released, so it’s the perfect time to continue our ElasticSearch vs. Solr series. In the last three parts of the ElasticSearch vs. Solr series we gave a general overview of the two search engines, about data handling, and about their full text search capabilities. In this part we look at how these two engines handle faceting.

Rafał continues his excellent comparison of Solr and ElasticSearch.

Understanding your software options is almost as important as understanding your data.

7 Symptoms you are turning into a Hadoop nerd

Tuesday, October 30th, 2012

7 Symptoms you are turning into a Hadoop nerd

Very funny!

Although, as you imagine, my answer for #2 differs. 😉

Enjoy!

Why I subscribe to the Ann Arbor Chronicle

Tuesday, October 30th, 2012

Why I subscribe to the Ann Arbor Chronicle by Jon Udell.

At one level, Jon’s post describes why he subscribes to a free online newspaper.

At another level, Jon is describing the value-add that makes content so valuable it attracts voluntary support.

The newspaper has built a context for reporting news, the reported news is situated in the context of prior reports and situated in a context built from other sources.

As opposed to reporting allegations, rumors or even facts, with no useful context in which to evaluate them.

If you prefer cartoons, visit Dilbert.com. Use the calendar icon to search for: February 7, 1993.

Building superior integrated applications with open source Apache Camel (Webinar)

Tuesday, October 30th, 2012

From the post:

I am scheduled to host a free webinar on building integrated applications using Apache Camel.

Date: November 6th, 2012 (moved due Sandy hurricane)
Time: 3:00 PM (Central European Time) – 10:00 AM (EDT)
Duration: 1h15m

This webinar will show you how to build integrated applications with open source Apache Camel. Camel is one of the most frequently downloaded projects, and it is changing the way teams approach integration. The webinar will start with the basics, continue with examples and how to get started, and conclude with live demo. We will cover

• Enterprise Integration Patterns
• Domain Specific Languages
• Maven and Eclipse tooling
• Java, Spring, OSGi Blueprint, Scala and Groovy
• Deployment options
• Extending Camel by building custom Components
• Q and A

Before we open for QA at the end of the session, we will share links where you can go and read and learn more about Camel. Don’t miss this informative session!

You can register for the webinar at this link.

Definitely on my list to attend.

You?

Apache Camel 2.11 – Neo4j and more new components

Tuesday, October 30th, 2012

Apache Camel 2.11 – Neo4j and more new components by Claus Ibsen.

From the post:

As usual the Camel community continues to be very active. For the upcoming Camel 2.11 release we have already five new components in the works

All five components started by members of the community, and not by people from the Camel team. For example the camel-neo4j, and camel-couchdb components is kindly donated to ASF by Stephen Samuel. Bilgin Ibryam contributed the camel-cmis component. And Cedric Vidal donated the camel-elastichsearch component. And lastly Scott Sullivan donated the camel-sjms component.

Just in case you live in a world where Enterprise Integration Patterns are relevant. 😉

If you are not familiar with Camel: Camel in Action, Chapter 1 (direct link) free chapter 1 of the Camel in Action book.

I first saw this at DZone.

Bibliographic Framework Transition Initiative

Tuesday, October 30th, 2012

Bibliographic Framework Transition Initiative

The original announcement for this project lists its requirements but the requirements are not listed on the homepage.

The requirements are found at: The Library of Congress issues its initial plan for its Bibliographic Framework Transition Initiative for dissemination, sharing, and feedback (October 31, 2011) . Nothing in the link text says “requirements here” to me.

Requirements as of the original announcement:

Requirements for a New Bibliographic Framework Environment

Although the MARC-based infrastructure is extensive, and MARC has been adapted to changing technologies, a major effort to create a comparable exchange vehicle that is grounded in the current and expected future shape of data interchange is needed. To assure a new environment will allow reuse of valuable data and remain supportive of the current one, in addition to advancing it, the following requirements provide a basis for this work. Discussion with colleagues in the community has informed these requirements for beginning the transition to a "new bibliographic framework". Bibliographic framework is intended to indicate an environment rather than a "format".

• Broad accommodation of content rules and data models. The new environment should be agnostic to cataloging rules, in recognition that different rules are used by different communities, for different aspects of a description, and for descriptions created in different eras, and that some metadata are not rule based. The accommodation of RDA (Resource Description and Access) will be a key factor in the development of elements, as will other mainstream library, archive, and cultural community rules such as Anglo-American Cataloguing Rules, 2nd edition (AACR2) and its predecessors, as well as DACS (Describing Archives, a Content Standard), VRA (Visual Resources Association) Core, CCO (Cataloging Cultural Objects).
• Provision for types of data that logically accompany or support bibliographic description, such as holdings, authority, classification, preservation, technical, rights, and archival metadata. These may be accommodated through linking technological components in a modular way, standard extensions, and other techniques.
• Accommodation of textual data, linked data with URIs instead of text, and both. It is recognized that a variety of environments and systems will exist with different capabilities for communicating and receiving and using textual data and links.
• Consideration of the relationships between and recommendations for communications format tagging, record input conventions, and system storage/manipulation. While these environments tend to blur with today’s technology, a future bibliographic framework is likely to be seen less by catalogers than the current MARC format. Internal storage, displays from communicated data, and input screens are unlikely to have the close relationship to a communication format that they have had in the past.
• Consideration of the needs of all sizes and types of libraries, from small public to large research. The library community is not homogeneous in the functionality needed to support its users in spite of the central role of bibliographic description of resources within cultural institutions. Although the MARC format became a key factor in the development of systems and services, libraries implement services according to the needs of their users and their available resources. The new bibliographic framework will continue to support simpler needs in addition to those of large research libraries.
• Continuation of maintenance of MARC until no longer necessary. It is recognized that systems and services based on the MARC 21 communications record will be an important part of the infrastructure for many years. With library budgets already stretched to cover resource purchases, large system changes are difficult to implement because of the associated costs. With the migration in the near term of a large segment of the library community from AACR to RDA, we will need to have RDA-adapted MARC available. While that need is already being addressed, it is recognized that RDA is still evolving and additional changes may be required. Changes to MARC not associated with RDA should be minimal as the energy of the community focuses on the implementation of RDA and on this initiative.
• Compatibility with MARC-based records. While a new schema for communications could be radically different, it will need to enable use of data currently found in MARC, since redescribing resources will not be feasible. Ideally there would be an option to preserve all data from a MARC record.
• Provision of transformation from MARC 21 to a new bibliographic environment. A key requirement will be software that converts data to be moved from MARC to the new bibliographic framework and back, if possible, in order to enable experimentation, testing, and other activities related to evolution of the environment.

The Library of Congress (LC) and its MARC partners are interested in a deliberate change that allows the community to move into the future with a more robust, open, and extensible carrier for our rich bibliographic data, and one that better accommodates the library community’s new cataloging rules, RDA. The effort will take place in parallel with the maintenance of MARC 21 as new models are tested. It is expected that new systems and services will be developed to help libraries and provide the same cost savings they do today. Sensitivity to the effect of rapid change enables gradual implementation by systems and infrastructures, and preserves compatibility with existing data.

Ongoing discussion at: Bibliographic Framework Transition Initiative Forum, BIBFRAME@LISTSERV.LOC.GOV.

The requirements recognize a future of semantic and technological heterogeneity.

Similar to the semantic and technological heterogeneity we have now and have had in the past.

A warning to those expecting a semantic and technological rapture of homogeneity.

(I first saw this initiative at: NoSQL Bibliographic Records: Implementing a Native FRBR Datastore with Redis.)

NoSQL Bibliographic Records:…

Tuesday, October 30th, 2012

From the background:

Using the Library of Congress Bibliographic Framework for the Digital Age as the starting point for software development requirements; the FRBR-Redis-Datastore project is a proof-of-concept for a next-generation bibliographic NoSQL system within the context of improving upon the current MARC catalog and digital repository of a small academic library at a top-tier liberal arts college.

The FRBR-Redis-Datastore project starts with a basic understanding of the MARC, MODS, and FRBR implemented using a NoSQL technology called Redis.

This presentation guides you through the theories and technologies behind one such proof-of-concept bibliographic framework for the 21st century.

Hadoop was just too complicated compared to the simple three-step Redis server set-up.

refreshing.

Simply because a technology is popular doesn’t mean it meets your requirements. Such as administration by non-full time technical experts.

An Oracle database supports applications that could manage garden club finances but that’s a poor choice under most circumstances.

The Redis part of the presentation is apparently not working (I get Python errors) as of today and I have sent a note with the error messages.

A “proof-of-concept” that merits your attention!

Kepler Telescope Data Release: The Power of Sharing Data

Tuesday, October 30th, 2012

Additional Kepler Data Now Available to All Planet Hunters

From the post:

The Space Telescope Science Institute in Baltimore, Md., is releasing 12 additional months worth of planet-searching data meticulously collected by one of the most prolific planet-hunting endeavors ever conceived, NASA’s Kepler Mission.

As of Oct. 28, 2012, every observation from the extrasolar planet survey made by Kepler since its launch in 2009 through June 27, 2012, is available to scientists and the public. This treasure-trove contains more than 16 terabytes of data and is housed at the Barbara A. Mikulski Archive for Space Telescopes, or MAST, at the Space Telescope Science Institute. MAST is a huge data archive containing astronomical observations from 16 NASA space astronomy missions, including the Hubble Space Telescope. It is named in honor of Maryland U.S. Senator Barbara A. Mikulski.

Over the past three years the Kepler science team has discovered 77 confirmed planets and 2,321 planet candidates. All of Kepler’s upcoming observations will be no longer exclusive to the Kepler science team, its guest observers, and its asteroseismology consortium members and will be available immediately to the public.

…..

In addition to yielding evidence for planets circling some of the target stars, the Kepler data also reveal information about the behavior of many of the other stars being monitored. Kepler astronomers have discovered star spots, flaring stars, double-star systems, and “heartbeat” stars, a class of eccentric binary systems undergoing dynamic tidal distortions and tidally induced pulsations.

There is far more data in the Kepler archives than astronomers have time to analyze quickly. Avid volunteer astronomers are invited to make Kepler discoveries by perusing the archive through a website called “Planet Hunters,” (http://www.planethunters.org/). A tutorial informs citizen scientists how to analyze the Kepler data, so they may assist with the research. Visitors to the website cannot actually see individual planets. Instead, they look for the effects of planets as they sweep across the face of their parent stars. Volunteer scientists have analyzed over 14 million observations so far. Just last week citizen scientists announced the discovery of the first planet to be found in a quadruple-star system.

The additional analysis by volunteer scientists, especially: “the first planet to be found in a quadruple-star system,” illustrates the power of sharing “big data.”

Summary and Links for CAP Articles on IEEE Computer Issue

Tuesday, October 30th, 2012

Summary and Links for CAP Articles on IEEE Computer Issue by Alex Popescu.

From the post:

Daniel Abadi has posted a quick summary of the articles signed by Eric Brewer, Seth Gilbert and Nancy Lynch, Daniel Abadi, Raghu Ramakrishnan, Ken Birman, Daniel Freedman, Qi Huang, and Patrick Dowell for the IEEE Computer issue dedicated to the CAP theorem. Plus links to most of them:

Linear Algebra: As an Introduction to Abstract Mathematics

Monday, October 29th, 2012

Linear Algebra: As an Introduction to Abstract Mathematics by Isaiah Lankham, Bruno Nachtergaele and Anne Schilling.

From the cover page:

Lecture Notes for MAT67, University of California, Davis, written Fall 2007, last updated October 9, 2011.

Organized to teach both the computational as well as abstract (proof) side of linear algebra.

Organizing a topic maps textbook along the same lines, computing (merging) between subject proxies as well as abstract theories of identification, could be quite useful.

I first saw this at Christophe Lalanne’s A bag of tweets / October 2012.

Introduction to Game Theory – Course

Monday, October 29th, 2012

Introduction to Game Theory – Course by John Duffy.

From the description:

This course is an introduction to game theory, the study of strategic behavior among parties having opposed, mixed or similar interests. This course will sharpen your understanding of strategic behavior in encounters with other individuals–modeled as games–and as a participant in broader markets involving many individuals. You will learn how to recognize and model strategic situations, to predict when and how your actions will influence the decisions of others and to exploit strategic situations for your own benefit.

Slides, homework (with answers), pointers to texts.

The notion of “rational” actors in the marketplace took a real blow from Alan Greenspan confessing before Congress that market players had not acted rationally.

Still, a highly developed field that may give you some insight into the “games” you are likely to encounter in social situations.

I first saw this at Christophe Lalanne’s A bag of tweets / October 2012.

Interactive data visualization with cranvas

Monday, October 29th, 2012

Interactive data visualization with cranvas by Christophe Lalanne.

From the post:

One of the advantage of R over other popular statistical packages is that it now has “natural” support for interactive and dynamic data visualization. This is, for instance, something that is lacking with the Python ecosystem for scientific computing (Mayavi or Enthought Chaco are just too complex for what I have in mind).

Some time ago, I started drafting some tutors on interactive graphics with R. The idea was merely to give an overview of existing packages for interactive and dynamic plotting, and it was supposed to be a three-part document: first part presents basic capabilities like rgl, aplpack, and iplot (alias Acinonyx)–this actually ended up as a very coarse draft; second part should present ggobi and its R interface; third and last part would be about the Qt interface, with qtpaint and cranvas. I hope I will find some time to finish this project as it might provide useful complements to my introductory statistical course on data visualization and statistics with R.

An update on interactive data visualization with R (cranvas).

High end visualization is within the reach of anyone with perseverance and a computer.

Per Christophe: Sometimes the visualization you need is the digital equivalent of pen and napkin.

Exploring Data in Engineering, the Sciences, and Medicine

Monday, October 29th, 2012

Exploring Data in Engineering, the Sciences, and Medicine by Ronald Pearson.

From the description:

The recent dramatic rise in the number of public datasets available free from the Internet, coupled with the evolution of the Open Source software movement, which makes powerful analysis packages like R freely available, have greatly increased both the range of opportunities for exploratory data analysis and the variety of tools that support this type of analysis.

This book will provide a thorough introduction to a useful subset of these analysis tools, illustrating what they are, what they do, and when and how they fail. Specific topics covered include descriptive characterizations like summary statistics (mean, median, standard deviation, MAD scale estimate), graphical techniques like boxplots and nonparametric density estimates, various forms of regression modeling (standard linear regression models, logistic regression, and highly robust techniques like least trimmed squares), and the recognition and treatment of important data anomalies like outliers and missing data. The unique combination of topics presented in this book separate it from any other book of its kind.

Intended for use as an introductory textbook for an exploratory data analysis course or as self-study companion for professionals and graduate students, this book assumes familiarity with calculus and linear algebra, though no previous exposure to probability or statistics is required. Both simulation-based and real data examples are included, as are end-of-chapter exercises and both R code and datasets.

I encountered this while reading Characterizing a new dataset by the same author.

If you think of topic maps as a means to capture the results of exploration of data sets. Explorations by different explorers, possibly for different reasons, the results of data exploration become grist for a topic map mill.

There are no reader reviews at Amazon but I would be happy to correct that. 😉

Characterizing a new dataset

Monday, October 29th, 2012

Characterizing a new dataset by Ronald Pearson.

From the post:

In my last post, I promised a further examination of the spacing measures I described there, and I still promise to do that, but I am changing the order of topics slightly.  So, instead of spacing measures, today’s post is about the DataframeSummary procedure to be included in the ExploringData package, which I also mentioned in my last post and promised to describe later.  My next post will be a special one on Big Data and Data Science, followed by another one about the DataframeSummary procedure (additional features of the procedure and the code used to implement it), after which I will come back to the spacing measures I discussed last time.

A task that arises frequently in exploratory data analysis is the initial characterization of a new dataset.  Ideally, everything we could want to know about a dataset should come from the accompanying metadata, but this is rarely the case.  As I discuss in Chapter 2 of Exploring Data in Engineering, the Sciences, and Medicine, metadata is the available “data about data” that (usually) accompanies a data source.  In practice, however, the available metadata is almost never as complete as we would like, and it is sometimes wrong in important respects.  This is particularly the case when numeric codes are used for missing data, without accompanying notes describing the coding.  An example, illustrating the consequent problem of disguised missing data is described in my paper The Problem of Disguised Missing Data.  (It should be noted that the original source of one of the problems described there – a comment in the UCI Machine Learning Repository header file for the Pima Indians diabetes dataset that there were no missing data records – has since been corrected.)

A rich post on using R to explore data sets.

The observation that ‘metadata is the available “data about data”’ should remind us that we use subjects to talk about other subjects. There isn’t any place to stand where subjects are not all around us.

Some metadata may be unspoken or missing, as Ronald observes, but that doesn’t make it any less important.

How do you record your discoveries about data sets for future re-use?

Or merge them with discoveries by others about the same data sets?

Top 5 Challenges for Hadoop MapReduce… [But Semantics Isn’t One Of Them]

Monday, October 29th, 2012

Top 5 Challenges for Hadoop MapReduce in the Enterprise

IBM sponsored content at Datanami.com lists these challenges for Hadoop MapReduce in enterprise settings:

• Lack of performance and scalability….
• Lack of flexible resource management….
• Lack of application deployment support….
• Lack of quality of service assurance….
• Lack of multiple data source support….

Who would know enterprise requirements better than IBM? They have been in the enterprise business long enough to be an enterprise themselves.

If IBM says these are the top 5 challenges for Hadoop MapReduce in enterprises, it’s a good list.

But I don’t see “semantics” in that list.

Do you?

Semantics make it possible to combine data from different sources, process it and report a useful answer.

Or rather understanding data semantics and mapping between them makes a useful answer possible.

Try pushing data from different sources together without understanding and mapping their semantics.

It won’t take long for you to decide which way you prefer.

If semantics are critical to any data operation, including combining data from diverse sources, why do they get so little attention?

Doubt your IBM representative would know but you could ask them, while trying out the IBM solution to the “top 5 challenges for Hadoop MapReduce:”

How you should discover and then map the semantics of diverse data sources?

Having mapped them once, can you re-use that mapping for future projects with the IBM solution?

LTM – Cheat-Sheet

Sunday, October 28th, 2012

LTM – Cheat-Sheet

My marketing staff advised: “The customer is always right.” 😉

I created this LTM cheat-sheet, based on “The Linear Topic Map Notation, version 1.3, by Lars Marius Garshol.

Thought it might be of interest.

Update: 3 November 2012. Latest version is LTM — Cheat-Sheet 0.3. Post announcing it: LTM — Cheat-Sheet Update (One update begats another).

1MB Sorting Explained

Sunday, October 28th, 2012

1 MB Sorting Explained by Jeff Preshing.

From the post:

In my previous post, I shared some source code to sort one million 8-digit numbers in 1MB of RAM as an answer to this Stack Overflow question. The program works, but I didn’t explain how, leaving it as a kind of puzzle for the reader.

(image omitted)

I had promised to explain it in a followup post, and in the meantime, there’s been a flurry of discussion in the comments and on Reddit. In particular, commenter Ben Wilhelm (aka ZorbaTHut) already managed to explain most of it (Nice work!), and by now, I think quite a few people already get it. Nonetheless, I’ll write up another explanation as promised.

You may want to also review the answers and comments at Stack Overflow as well.

Sorting being one of those fundamental operations you will encounter time and again.

Even in topic maps.

Acknowledging Errors in Data Quality

Sunday, October 28th, 2012

Acknowledging Errors in Data Quality by Jim Harris.

From the post:

The availability heuristic is a mental shortcut that occurs when people make judgments based on the ease with which examples come to mind. Although this heuristic can be beneficial, such as when it helps us recall examples of a dangerous activity to avoid, sometimes it leads to availability bias, where we’re affected more strongly by the ease of retrieval than by the content retrieved.

In his thought-provoking book “Thinking, Fast and Slow,” Daniel Kahneman explained how availability bias works by recounting an experiment where different groups of college students were asked to rate a course they had taken the previous semester by listing ways to improve the course — while varying the number of improvements that different groups were required to list.

Jim applies the result of Kahneman’s experiment to data quality issues and concludes:

• Isolated errors – Management chooses one-time data cleaning projects.
• Ten errors – Management concludes overall data quality must not be too bad (availability heuristic).

I need to re-read Kahneman but have you seen suggestions for overcoming the availability heuristic?

Rethinking the Basics of Financial Reporting

Sunday, October 28th, 2012

Rethinking the Basics of Financial Reporting by Timo Elliott.

From the post:

The chart of accounts is one of the fundamental building blocks of finance – but it’s time to rethink it from scratch.

To organize corporate finances and track financial health, traditional financial systems typically use complex, rigid general ledger structures. The result is painful, unwieldy systems that are not agile enough to support the requirements of modern finance.

In the financial engines of the future, rigid “code block” architectures are eliminated, replaced by flexible in-memory structures. The result is a dramatic increase in the flexibility and speed general ledger entries can be stored and retrieved. Organizations can vastly simplify their chart of accounts and minimize or eliminate time-consuming and complex reconciliation, while retaining virtually unlimited flexibility to report on any business dimension they choose.

Tim’s point is quite sound.

Except that we all face the same “…complex, rigid general ledger structures.” None of us has an advantage over another in that regard.

Once we have the information in hand, we can and do create more useful representations of the same data, but current practice gets everyone off to an even start.

Or evenly disadvantaged if you prefer.

As regulators start to demand reporting that takes advantage of modern information techniques, how will “equality” of access be defined?