Archive for June, 2011

Enterprise Federated Query is a hoax!

Thursday, June 30th, 2011

Enterprise Federated Query is a hoax! by Enoch Moses, January 9, 2008.

Everyone uses the federated query example as a great application in the Service Oriented Architecture (SOA) paradigm. Via SOA paradigm, the application developers can integrate their client application to numerous services and this will allow the client application users to search and browse various data sources. This sounds great in theory however an enterprise federated query application is not possible. Before we look into why an enterprise federate query application will be a reality, we need to understand what is a federated query application (fqa).

Problems listed:

  • High number or infinity number of data sources
  • Security
  • Data Source variation
  • Result set Aggregation
  • Governance

It has been a little over three (3) years since that post.

True? Still true? Ever true? True then but not now? True then but not in the near future (our next IPO)?

Guide to Programming in Clojure for Beginners

Thursday, June 30th, 2011

Guide to Programming in Clojure for Beginners

Now there’s a learn a programming language exercise! Write a blogging platform in it!

Any Clojure experts want to venture an evaluation?

pyblueprints 0.1

Thursday, June 30th, 2011

pyblueprints 0.1

From the webpage:

Following the set of interfaces provided by tinkerpop for Blueprints, this proyect aims to give Python developers a similar functionality. A set of abstract classes are defined in order to guide the design of implementations for the different graph database engines.

If you are not already familiar with BluePrints, a common API for graph databases, you might want to start at TinkerPop. Or you can jump directly to BluePrints if you are checking a detail or what to help with the code.

From the Blueprints webpage:

Blueprints is a property graph model interface. It provides implementations, test suites, and supporting extensions. Graph databases and frameworks that implement the Blueprints interfaces automatically support Blueprints-enabled applications. Likewise, Blueprints-enabled applications can plug-and-play different Blueprints-enabled graph backends.

MayBMS – A Probabilistic Database Management System

Thursday, June 30th, 2011

MayBMS – A Probabilistic Database Management System

From the homepage:

MayBMS is a state-of-the-art probabilistic database management system developed as an extension of the Postgres server backend (download).

The MayBMS project is founded on the thesis that a principled effort to use and extend mature relational database technology will be essential for creating robust and scalable systems for managing and querying large uncertain datasets.

MayBMS stands alone as a complete probabilistic database management system that supports a very powerful, compositional query language (examples) for which nevertheless worst-case efficiency and result quality guarantees can be made. Central to this is our choice of essentially using probabilistic versions of conditional tables as the representation system, but in a form engineered for admitting the efficient evaluation and automatic optimization of most operations of our language using robust and mature relational database technology.

Another probabilistic system.

I wonder about the consistency leg of CAP as a database principle. Is is a database principle only because we have had such locally located and small data sets that consistency was possible?

Think about any of the sensor arrays and memory banks located light seconds or even minutes away from data stores on Earth. As a practical matter they are always inconsistent with Earth bound data stores. Physical remoteness is the cause of inconsistency in that case. But what of something as simple as not all data having first priority for processing? Or varying priorities for processing depending upon system load? Or even analysis or processing of data that causes a lag between the states of data at different locations?

I’m not suggesting the usual cop-out of eventual consistency because the data may never be consistent. At least in the sense that we use the term for a database located on a single machine or local cluster. We may have to ask, “How consistent do you want the data to be upon delivery?,” knowing the data may be inconsistent on delivery with other data already in existence.

Faceting Module for Lucene!

Thursday, June 30th, 2011

Faceting Module for Lucene!

Reading the log for this issue is an education on how open source projects proceed at their best.

Oh, worth reading about the faceting aspects that you want to include in a topic map or other application as well.

Providing and discovering definitions of URIs

Wednesday, June 29th, 2011

Providing and discovering definitions of URIs by Jonathan A. Rees.

Abstract:

The specification governing Uniform Resource Identifiers (URIs) [rfc3986] allows URIs to mean anything at all, and this unbounded flexibility is exploited in a variety contexts, notably the Semantic Web and Linked Data. To use a URI to mean something, an agent (a) selects a URI, (b) provides a definition of the URI in a manner that permits discovery by agents who encounter the URI, and (c) uses the URI. Subsequently other agents may not only understand the URI (by discovering and consulting the definition) but may also use the URI themselves.

A few widely known methods are in use to help agents provide and discover URI definitions, including RDF fragment identifier resolution and the HTTP 303 redirect. Difficulties in using these methods have led to a search for new methods that are easier to deploy, and perform better, than the established ones. However, some of the proposed methods introduce new problems, such as incompatible changes to the way metadata is written. This report brings together in one place information on current and proposed practices, with analysis of benefits and shortcomings of each.

The purpose of this report is not to make recommendations but rather to initiate a discussion that might lead to consensus on the use of current and/or new methods.

The criteria for success:

  1. Simple. Having too many options or too many things to remember makes discovery fragile and impedes uptake.
  2. Easy to deploy on Web hosting services. Uptake of linked data depends on the technology being accessible to as many Web publishers as possible, so should not require control over Web server behavior that is not provided by typical hosting services.
  3. Easy to deploy using existing Web client stacks. Discovery should employ a widely deployed network protocol in order to avoid the need to deploy new protocol stacks.
  4. Efficient. Accessing a definition should require at most one network round trip, and definitions should be cacheable.
  5. Browser-friendly. It should be possible to configure a URI that has a discoverable definition so that ‘browsing’ to it yields information useful to a human.
  6. Compatible with Web architecture. A URI should have a single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name.

.

I had to look it up to get the page number but I remembered Karl Wiegers in Software Requirements saying:

Feasible

It must be possible to implement each requirement within the known capabilities and limitations of the system and its environment.

The single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name requirement is not feasible. It will stymie this project, despite the array of talent on hand, until it is no longer a requirement.

Need proof? Name one URI with a single agreed meaning globally, whether it’s used as a protocol element, hyperlink, or name.

Not one that the W3C TAG, or TBL or anyone else thinks/wants/prays has a single agree meaning globally, … but one that in fact has such a global meaning.

It’s been more than ten years. Let’s drop the last requirement and let the rather talented group working on this come up with a solution that meets the other five (5) requirements.

It won’t be a universal solution but then neither is the WWW.

Topic Modeling Sarah Palin’s Emails

Wednesday, June 29th, 2011

Topic Modeling Sarah Palin’s Emails from Edwin Chen.

From the post:

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

Interesting analysis and promise of more to follow.

With a US presidential election next year, there is little doubt there will be friendly as well as hostile floods of documents.

Time to sharpen your data extraction tools.

Path Finding with Neo4j

Wednesday, June 29th, 2011

Path Finding with Neo4j by Josh Adell.

From the post:

In my previous post I talked about graphing databases (Neo4j in particular) and how they can be applied to certain classes of problems where data may have multiple degrees of separation in their relationships.

The thing that makes graphing databases useful is the ability to find relationship paths from one node to another. There are many algorithms for finding paths efficiently, depending on the use case.

When they say “multiple degrees of separation in their relationships” that sounds a lot like topic maps to me. Or at least topic maps in some use cases I should say.

Enjoy the post and what I anticipate to follow.

LarKC: The Large Knowledge Collider

Wednesday, June 29th, 2011

LarKC: The Large Knowledge Collider

A tweet about a video on LarKC sent me looking for the project. From the webpage:

The aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web.

This will be achieved by:

  • Enriching the current logic-based Semantic Web reasoning methods with methods from information retrieval, machine learning, information theory, databases, and probabilistic reasoning,
  • Employing cognitively inspired approaches and techniques such as spreading activation, focus of attention, reinforcement, habituation, relevance reasoning, and bounded rationality.
  • Building a distributed reasoning platform and realizing it both on a high-performance computing cluster and via “computing at home”.

Listening to the video while writing this post but did I hear correctly that data would have to be transformed into a uniform format or vocabulary? Was listening to: http://videolectures.net/larkcag09_vanharmelen_llkc/, try around time mark 12:00 and following.

I also noticed on the project homepage:


Start: 01-April-08
End: 30-Sep-11
Duration 42 months

So, what happens to LarKC on 1-Oct-11?

NoSQL should be in your business…

Wednesday, June 29th, 2011

NoSQL should be in your business, and MongoDB could lead the way by Savio Rodrigues.

From the post:

NoSQL is still not well understood, as a term or a database market category, by IT decision makers. However, one NoSQL vendor — 10gen, creators of the open source MongoDB — appears to be growing into enterprise accounts and distancing itself from competitors. If you’re considering, or curious about, NoSQL databases, I recommend you spend some time looking at MongoDB.

One important fact is that the demand for Mongo and MongoDB skills is getting larger.

If you are looking for more information on MongoDB, check out www.mongodb.org and www.10gen.com but in particular see: www.10gen.com/presentations.

I saw the notice about the videos in an email alert so had to delete all the tracking URL crap, then delete the path to the specific video with all its tracking crap, then I was able to give you a link to the page with the videos so you could make your own choice. Less tracking, more choice. That sounds like a better plan.

R2R Framework

Wednesday, June 29th, 2011

R2R Framework

The R2R Framework is used by the LDIF – Linked Data Integration Framework.

The R2R User Manual contains the specification and will likely be of more use than the website.

LDIF – Linked Data Integration Framework

Wednesday, June 29th, 2011

LDIF – Linked Data Integration Framework 0.1

From the webpage:

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain datasets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an initial alpha version of an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI aliases.

More comments will follow, but…

Isn’t this the reverse of the well-known synonym table in IR?

Instead of substituting synonyms in the query expression, the underlying data is being transformed to produce…a lack of synonyms?

No, not the reverse of a synonym table, in synonym table terms, we would lose the synonym table and transform the underlying textual data to use only a single term where before there were N terms, all of which occurred in the synonym table.

If I search for a term previously listed in the synonym table, but one replaced by a common term, my search result will be empty.

No more synonyms? That sounds like a bad plan to me.

Marketing What Users Want…And An Example

Tuesday, June 28th, 2011

Dick Weisinger in Information Overload: The Data Management Challenge cites the following numbers from a data management survey:

  • 36 percent of organizations say that email overload is their biggest data management problem
  • 28 percent say that document and content management is their biggest issue
  • 15 percent cite information access controls
  • 13 percent point to compliance issues that they must deal with
  • 8 percent say that social media is an area that causes them headaches

Hmmm, not even one percent (1%) said semantic integration was an issue.

Maybe they haven’t heard that semantic integration is all the rage in IT circles? Don’t they read Wired or Scientific American?

I am sure most of them do. Probably the same percentage as you would find at a semantic technology conference.

The difference is they are facing specific problems in an enterprise context. Problems for which they need solutions they can sell to their management as cost effective and doable. By yesterday. The full generality of semantic integration makes nice weekend reading but their management won’t sit long for it on the following Monday.

I am not warranting the following example is feasible or even useful but pose it as a thought experiment.

Assume management agrees email overload is a serious problem and suspects it stems from too many cc’s on posts. There are any number of ways to track such posts but let me outline a topic map solution.

First, create a topic map of the organizational structure, along with approval and informational relationships. This could become more fine grained but for purposes of illustration let’s start with those two relationships. The email addresses for the various actors are included for each person.

Second, since IT runs the SMTP servers that process all the email sent by employees, a copy of every message is stored with associations in the topic map between sender and its recipient(s).

Third, after a month, a graphical map is presented to management showing emails inside/outside of approval/informational paths, along with senders and recipients of those posts.

Fourth, I would suggest discovering what functions are being performed by the targets of large numbers of out of band posts. They maybe informal information hubs who need more formalized roles or greater responsibility. Or the approval/informational structures need revising.

Fifth, for the truly bold, the IT department can filter email to decision makers to allow only a restricted set of staff to reach them by email, thereby reducing their information load from intra- as well as inter-company email. I am sure they will be thankful to not have to setup their own email filters.

You don’t have to use “topic map,” or “semantic integration,” or other such buzz words in selling such a solution. You can insert those as appropriate in the email story at your next semantic integration conference presentation.

Explore the Marvel Universe Social Graph

Tuesday, June 28th, 2011

Explore the Marvel Universe Social Graph

From the post (but be sure to see the images):

From Friday evening to Sunday afternoon, Kai Chang, Tom Turner, and Jefferson Braswell were tuning their visualizations and had a lot of fun exploring Spiderman or Captain america ego network. They came with these beautiful snapshots and created a zoomable web version using the Seadragon plugin. The won the “Most aesthetically pleasing visualization” category, congratulations to Kai, Tom and Jefferson for their amazing work!

The datasets have been added to the wiki Datasets page, so you can play with it and maybe calculate some metrics like centrality on the network. The graph is pretty large, so be sure to increase you Gephi memory settings with > 2GB.

I am sure the Marvel Comic graph is a lot more amusing but I can’t help but wonder about ego networks that combined:

  • Lobbyists registered with the US government
  • Elected and appointed officials and their staffs, plus staff’s families
  • Washington social calendar reports
  • Political donations
  • The Green Book

Topic maps could play a role in layering contracts, legislation and other matters onto the various ego networks.

Neo4j 1.4 M05 “Kiruna Stol”

Tuesday, June 28th, 2011

Neo4j 1.4 M05 “Kiruna Stol” – MidSummer Celebration

From the post:

Extending the festive atmosphere of Midsummer here in Sweden (though sadly not the copious amounts of beer and strawberries), we’re releasing the final milestone build of Neo4j 1.4. The celebration includes: Auto Indexing, neat new features to the REST API, even cooler Cypher query language features, and a bunch of performance improvements. We’ve also rid ourselves of the 3rd-party service wrapper code (yay!) that caused us and our fellow (mostly Mac) users in the community so much anguish!

Hooray!

Big Data Genomics – How to efficiently store and retrieve mutation

Tuesday, June 28th, 2011

Big Data Genomics – How to efficiently store and retrieve mutation data by David Suvee.

About the post:

This blog post is the first one in a series of articles that describe the use of NoSQL databases to efficiently store and retrieve mutation data. Part one introduces the notion of mutation data and describes the conceptual use of the Cassandra NoSQL datastore.

From the post:

The only way to learn a new technology is by putting it into practice. Just try to find a suitable use case in your immediate working environment and give it go. In my case, it was trying to efficiently store and retrieve mutation data through a variety of NoSQL data stores, including Cassandra, MongoDB and Neo4J.

Promises to be an interesting series of posts that focus on a common data set and problem!

GoldenOrb – Released

Tuesday, June 28th, 2011

GoldenOrb

From the webpage:

GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture. Our goal is to foster solutions to complex data problems, remove limits to innovation and contribute to the emerging ecosystem that spans all aspects of big data analysis.

Anticipated for some time, see: Beyond MapReduce – Large Scale Graph Processing With GoldenOrb
.

Get started with Hadoop…

Tuesday, June 28th, 2011

Get started with Hadoop: From evaluation to your first production cluster by Brett Sheppard.

From the introduction:

This piece provides tips, cautions and best practices for an organization that would like to evaluate Hadoop and deploy an initial cluster. It focuses on the Hadoop Distributed File System (HDFS) and MapReduce. If you are looking for details on Hive, Pig or related projects and tools, you will be disappointed in this specific article, but I do provide links for where you can find more information.

Highly recommended!

HortonWorks

Tuesday, June 28th, 2011

I read in Alex Popescu’s myNoSQL, Yahoo Launches Hadoop Spinoff, who pointed to GigaOm, Exclusive: Yahoo launching Hadoop spinoff this week (Dick Harris), announcement anticipated Tuesday (June 28th) or at the Hadoop Summit 2011, on Wednesday, June 29th.

Informational only, no release at this point.

Mapreduce & Hadoop Algorithms in Academic Papers (4th update)

Tuesday, June 28th, 2011

Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)

From the post:

It’s been a year since I updated the mapreduce algorithms posting last time, and it has been truly an excellent year for mapreduce and hadoop – the number of commercial vendors supporting it has multiplied, e.g. with 5 announcements at EMC World only last week (Greenplum, Mellanox, Datastax, NetApp, and Snaplogic) and today’s Datameer funding announcement , which benefits the mapreduce and hadoop ecosystem as a whole (even for small fish like us here in Atbrox). The work-horse in mapreduce is the algorithm, this update has added 35 new papers compared to the prior posting, new ones are marked with *. I’ve also added 2 new categories since the last update – astronomy and social networking.

Spark – Lighting-Fast Cluster Computing

Monday, June 27th, 2011

Spark – Lighting-Fast Cluster Computing

From the webpage:

What is Spark?

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark integrates into the Scala language, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala interpreter.

What can it do?

Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can outperform Hadoop by 30x. However, you can use Spark’s convenient API to for general data processing too. Check out our example jobs.

Spark runs on the Mesos cluster manager, so it can coexist with Hadoop and other systems. It can read any data source supported by Hadoop.

Who uses it?

Spark was developed in the UC Berkeley AMP Lab. It’s used by several groups of researchers at Berkeley to run large-scale applications such as spam filtering, natural language processing and road traffic prediction. It’s also used to accelerate data analytics at Conviva. Spark is open source under a BSD license, so download it to check it out!

Hadoop must be doing something right to be treated as the solution to beat.

Still, depending on your requirements, Spark definitely merits your consideration.

Introduction to Cypher

Monday, June 27th, 2011

Introduction to Cypher

From the webpage:

Michael Hunger introduces basic graph queries on a movie dataset using Neo4j’s Cypher language.

Short but impressive! Show this one to anyone you want to convince about using graph databases.

TinySearchEngine

Monday, June 27th, 2011

TinySearchEngine

A search engine written in 30 lines of Scala.

Features:

  • in-memory index
  • norms and IDF calculated online
  • default OR operator between query terms
  • index a document per line from a single file
  • read stopwords from a file

FPGA Based Face Detection System Using Haar Classifiers

Monday, June 27th, 2011

FPGA Based Face Detection System Using Haar Classifiers

From the abstract:

This paper presents a hardware architecture for face detection based system on AdaBoost algorithm using Haar features. We describe the hardware design techniques including image scaling, integral image generation, pipelined processing as well as classifier, and parallel processing multiple classifiers to accelerate the processing speed of the face detection system. Also we discuss the optimization of the proposed architecture which can be scalable for configurable devices with variable resources. The proposed architecture for face detection has been designed using Verilog HDL and implemented in Xilinx Virtex-5 FPGA. Its performance has been measured and compared with an equivalent software implementation. We show about 35 times increase of system performance over the equivalent software implementation.

Of interest for topic map applications designed to associate data with particular individuals.

The page offers useful links to other face recognition material.

Scalable Query Processing in Probabilistic Databases

Monday, June 27th, 2011

Scalable Query Processing in Probabilistic Databases

From the webpage:

Today, uncertainty is commonplace in data management scenarios dealing with data integration, sensor readings, information extraction from unstructured sources, and whenever information is manually entered and therefore prone to inaccuracy or partiality. Key challenges in probabilistic data management are to design probabilistic database formalisms that can compactly represent large sets of possible interpretations of uncertain data together with their probability distributions, and to efficiently evaluate queries on very large probabilistic data. Such queries could ask for confidences in data patterns possibly in the presence of additional evidence. The problem of query evaluation in probabilistic databases is still in its infancy. Little is known about which queries can be evaluated in polynomial time, and the few existing evaluation methods employ expensive main-memory algorithms.

The aim of this project is to develop techniques for scalable query processing in probabilistic databases and use them to build a robust query engine called SPROUT ( Scalable PROcessing on Tables). We are currently exploring three main research directions.


  • We are investigating open problems in efficient query evaluation. In particular, we aim at discovering classes of tractable (i.e., computable in polynomial time wrt data complexity) queries on probabilistic databases. The query language under investigation is SQL (and its formal core, relational algebra) extended with uncertainty-aware query constructs to create probabilistic data under various probabilistic data models (such as tuple-independent databases, block-independent disjoint databases, or U-relations of MayBMS).
  • For the case of intractable queries, we investigate approximate query evaluation. In contrast to exact evaluation, which computes query answers together with their exact confidences, approximate evaluation computes the query answers with approximate confidences. We are working on new techniques for approximate query evaluation that are aware of the query and the input probabilistic database model (tuple-independent, block-independent disjoint, etc).
  • Our open-source query engine for probabilistic data management systems uses the insights gained from the first two directions. This engine is based on efficient secondary-storage exact and approximate evaluation algorithms for arbitrary queries.

As of June 2, 2011, order Probabilistic Databases by Dan Suciu, Dan Olteanu, Christopher Re, and Christoph Kock from Amazon.

Exciting work!

It occurs to me that semantics are always “probabilistic.”

What does that say about the origin of the semantics of a term?

If semantics are probabilistic, is it ever possible to fix the semantic of a term?

If so, how?

Implementation of Functional Programming
Languages – Graph Reduction

Monday, June 27th, 2011

The Implementation of Functional Programming Languages by Simon L Peyton-Jones. (1987)

From the Preface:

This book is about implementing functional programming languages using graph reduction.

There appear to be two main approaches to the efficient implementation of functional languages. The first is an environment-based scheme, exemplified by Cardelli’s ML implementation, which derives from the experience of the Lisp community. The other is graph reduction, a much newer technique first invented by Wadsworth [Wadsworth, 1971], and on which the Ponder and Lazy ML implementations are founded. Despite the radical differences in beginnings, the most sophisticated examples from each approach show remarkable similarities.

This book is intended to have two main applications:

(i) As a course text for part of an undergraduate or postgraduate course on the implementation of functional languages.

(ii) As a handbook for those attempting to write a functional language implementation based on graph reduction.

You may also enjoy:

Implementing functional languages: a tutorial by Simon Peyton-Jones and David Lester. (1992)

You may be thinking that 1971 is a bit old for a newer technique (it’s not) but in that case you need to look at the homepage of Simon Peyton-Jones. Current work includes Parallel Haskell, Haskell in the Cloud, etc. Be prepared to get lost for some time.

Data-gov Wiki

Monday, June 27th, 2011

Data-gov Wiki

From the wiki:

The Data-gov Wiki is a project being pursued in the Tetherless World Constellation at Rensselaer Polytechnic Institute. We are investigating open government datasets using semantic web technologies. Currently, we are translating such datasets into RDF, getting them linked to the linked data cloud, and developing interesting applications and demos on linked government data. Most of the datasets shown on this page come from the US government’s data.gov Web site, although some are from other countries or non-government sources.

Try out their Drupal site with new demos:

Linking Open Government Data

My misgivings about the “openness” that releasing government data brings to one side, the Drupal site is a job well done and merits your attention.

Gartner Restates The Obvious, Again

Monday, June 27th, 2011

The press release, Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data, did not take anyone interested in ‘Big Data’ by surprise.

From the news release:

Worldwide information volume is growing annually at a minimum rate of 59 percent annually, and while volume is a significant challenge in managing big data, business and IT leaders must focus on information volume, variety and velocity.

Volume: The increase in data volumes within enterprise systems is caused by transaction volumes and other traditional data types, as well as by new types of data. Too much volume is a storage issue, but too much data is also a massive analysis issue.

Variety: IT leaders have always had an issue translating large volumes of transactional information into decisions — now there are more types of information to analyze — mainly coming from social media and mobile (context-aware). Variety includes tabular data (databases), hierarchical data, documents, e-mail, metering data, video, still images, audio, stock ticker data, financial transactions and more.

Velocity: This involves streams of data, structured record creation, and availability for access and delivery. Velocity means both how fast data is being produced and how fast the data must be processed to meet demand.

While big data is a significant issue, Gartner analysts said the real issue is making sense of big data and finding patterns in it that help organizations make better business decisions.

Whether data is ‘big’ or ‘small,’ the real issue has always been making sense of it and using it to make business decisions. Did anyone ever contend otherwise?

As far as ‘big data,’ I think there are two not entirely obvious impacts it may have on analysis:

1) The streetlamp effect: We have all heard of or seen the cartoon with the guy searching for his car keys under a streetlamp. When someone stops to help and asks where he lost them, he points off into the darkness. When asked why he is searching here, the reply is “The light is better over here.”

With “big data,” there can be a tendency, having collected “big data,” to assume the answer must lie in its analysis. Perhaps so but having gathered “big data,” is no guarantee you have the right big data or that it is the data that can answer the question being posed. Start with your question and not the “big data” you happen to have on hand.

2) Similar to the first as data that does not admit to easy processing, data that is semantically diverse or simply not readily available/processable, may be ignored. Which may lead to a false sense of confidence in the data that is analyzed. This danger is particularly real when preliminary results with available data confirm current management plans or understandings.

Making sense out of data (big, small, or in-between) has always been the first step in its use in a business decision process. Even non-Gardner clients know that much.

DataCaml – a first look at distributed dataflow programming in OCaml

Sunday, June 26th, 2011

DataCaml – a first look at distributed dataflow programming in OCaml

From the post:

Distributed programming frameworks like Hadoop and Dryad are popular for performing computation over large amounts of data. The reason is programmer convenience: they accept a query expressed in a simple form such as MapReduce, and automatically take care of distributing computation to multiple hosts, ensuring the data is available at all nodes that need it, and dealing with host failures and stragglers.

A major limitation of Hadoop and Dryad is that they are not well-suited to expressing iterative algorithms or dynamic programming problems. These are very commonly found patterns in many algorithms, such as k-means clustering, binomial options pricing or Smith Waterman for sequence alignment.

Over in the SRG in Cambridge, we developed a Turing-powerful distributed execution engine called CIEL that addresses this. The NSDI 2011 paper describes the system in detail, but here’s a shorter introduction.

The post gives an introduction to the OCaml API.

The CIEL Execution Engine description begins with:

CIEL consists of a master coordination server and workers installed on every host. The engine is job-oriented: a job consists of a graph of tasks which results in a deterministic output. CIEL tasks can run in any language and are started by the worker processes as needed. Data flows around the cluster in the form of references that are fed to tasks as dependencies. Tasks can publish their outputs either as concrete references if they can finish the work immediately or as a future reference. Additionally, tasks can dynamically spawn more tasks and delegate references to them, which makes the system Turing-powerful and suitable for iterative and dynamic programming problems where the task graph cannot be computed statically.

BTW, you can also have opaque references, which progress for a while, then stop.

Deeply interesting work.

Hickey and the Associative Data Model

Sunday, June 26th, 2011

Rich Hickey Q&A by Michael Fogus appears in Code Quarterly, The Hackademic Journal.

The interview is entertaining but I mention it because of Hickey’s remarks on associative models, which reads in part:

When we drop down to the algorithm level, I think OO can seriously thwart reuse. In particular, the use of objects to represent simple informational data is almost criminal in its generation of per-piece-of-information micro-languages, i.e. the class methods, versus far more powerful, declarative, and generic methods like relational algebra. Inventing a class with its own interface to hold a piece of information is like inventing a new language to write every short story. This is anti-reuse, and, I think, results in an explosion of code in typical OO applications. Clojure eschews this and instead advocates a simple associative model for information. With it, one can write algorithms that can be reused across information types.

Topic maps are about semantic reuse.