Archive for the ‘Data Management’ Category

brename – data munging tool

Thursday, August 31st, 2017

brename — a practical cross-platform command-line tool for safely batch renaming files/directories via regular expression

Renaming files is a daily activity when data munging. Wei Shen has created a batch renaming tool with these features:

  • Cross-platform. Supporting Windows, Mac OS X and Linux.
  • Safe. By checking potential conflicts and errors.
  • File filtering. Supporting including and excluding files via regular expression.
    No need to run commands like find ./ -name "*.html" -exec CMD.
  • Renaming submatch with corresponding value via key-value file.
  • Renaming via ascending integer.
  • Recursively renaming both files and directories.
  • Supporting dry run.
  • Colorful output. Screenshots:

Binaries are available for Linux, OS X and Windows, both 32 and 64-bit versions.

Linux has a variety of batch file renaming options but I didn’t see any short-comings in brename that jumped out at me.


HT, Stephen Turner.

Unmet Needs for Analyzing Biological Big Data… [Data Integration #1 – Spells Market Opportunity]

Wednesday, February 15th, 2017

Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators by Lindsay Barone, Jason Williams, David Micklos.


In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principle investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multi-step workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

In particular, needs topic maps can address rank #1, #2, #6, #7, and #10, or as found by the authors:

A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Figure 3). Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HP computing (71%) were the three greatest unmet needs. High performance computing was an unmet need for only 27% of PIs—with similar percentages across disciplines, different sized groups, and NSF programs.

or graphically (figure 3):

So, cloud, distributed, parallel, pipelining, etc., processing is insufficient?

Pushing undocumented and unintegratable data at ever increasing speeds is impressive but gives no joy?

This report will provoke another round of Esperanto fantasies, that is the creation of “universal” vocabularies, which if used by everyone and back-mapped to all existing literature, would solve the problem.

The number of Esperanto fantasies and the cost/delay of back-mapping to legacy data defeats all such efforts. Those defeats haven’t prevented repeated funding of such fantasies in the past, present and no doubt the future.

Perhaps those defeats are a question of scope.

That is rather than even attempting some “universal” interchange of data, why not approach it incrementally?

I suspect the PI’s surveyed each had some particular data set in mind when they mentioned data integration (which itself is a very broad term).

Why not seek out, develop and publish data integrations in particular instances, as opposed to attempting to theorize what might work for data yet unseen?

The need topic maps wanted to meet remains unmet. With no signs of lessening.

Opportunity knocks. Will we answer?

NiFi 1.0

Wednesday, August 31st, 2016

NiFi 1.0 (download page)

NiFi 1.0 dropped today!

From the NiFi homepage:

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Some of the high-level capabilities and objectives of Apache NiFi include:

  • Web-based user interface
    • Seamless experience between design, control, feedback, and monitoring
  • Highly configurable
    • Loss tolerant vs guaranteed delivery
    • Low latency vs high throughput
    • Dynamic prioritization
    • Flow can be modified at runtime
    • Back pressure
  • Data Provenance
    • Track dataflow from beginning to end
  • Designed for extension
    • Build your own processors and more
    • Enables rapid development and effective testing
  • Secure
    • SSL, SSH, HTTPS, encrypted content, etc…
    • Multi-tenant authorization and internal authorization/policy management

I haven’t been following this project but the expression language for manipulation of data in a flow looks especially interesting.

Wrangler Conference 2015

Friday, November 13th, 2015

Wrangler Conference 2015


Given the panel nature of some of the presentatons, ordering these videos by speaker would not be terribly useful.

However, I have exposed the names of the participants in a single list of all the videos.


Polyglot Data Management – Big Data Everywhere Recap

Monday, March 23rd, 2015

Polyglot Data Management – Big Data Everywhere Recap by Michele Nemschoff.

From the post:

At the Big Data Everywhere conference held in Atlanta, Senior Software Engineer Mike Davis and Senior Solution Architect Matt Anderson from Liaison Technologies gave an in-depth talk titled “Polyglot Data Management,” where they discussed how to build a polyglot data management platform that gives users the flexibility to choose the right tool for the job, instead of being forced into a solution that might not be optimal. They discussed the makeup of an enterprise data management platform and how it can be leveraged to meet a wide variety of business use cases in a scalable, supportable, and configurable way.

Matt began the talk by describing the three components that make up a data management system: structure, governance and performance. “Person data” was presented as a good example when thinking about these different components, as it includes demographic information, sensitive information such as social security numbers and credit card information, as well as public information such as Facebook posts, tweets, and YouTube videos. The data management system components include:

It’s a vendor pitch so read with care but it comes closer than any other pitch I have seen to capturing the dynamic nature of data. Data isn’t the same from every source and you treat it the same at your peril.

If I had to say the pitch has a theme it is to adapt your solutions to your data and goals, not the other way around.

The one place where I may depart from the pitch is on the meaning of “normalization.” True enough we may want to normalize data a particular way this week, this month, but that should no preclude us from other “normalizations” should our data or requirements change.

The danger I see in “normalization” is that the cost of changing static ontologies, schemas, etc., leads to their continued use long after they have passed their discard dates. If you are as flexible with regard to your information structures as you are your data, then new data or requirements are easier to accommodate.

Or to put it differently, what is the use of being flexible with data if you intend to imprison it in a fixed labyrinth?

What Counts: Harnessing Data for America’s Communities

Friday, January 16th, 2015

What Counts: Harnessing Data for America’s Communities Senior Editors: Naomi Cytron, Kathryn L.S. Pettit, & G. Thomas Kingsley. (new book, free pdf)

From: A Roadmap: How To Use This Book

This book is a response to the explosive interest in and availability of data, especially for improving America’s communities. It is designed to be useful to practitioners, policymakers, funders, and the data intermediaries and other technical experts who help transform all types of data into useful information. Some of the essays—which draw on experts from community development, population health, education, finance, law, and information systems—address high-level systems-change work. Others are immensely practical, and come close to explaining “how to.” All discuss the incredibly exciting opportunities and challenges that our ever-increasing ability to access and analyze data provide.

As the book’s editors, we of course believe everyone interested in improving outcomes for low-income communities would benefit from reading every essay. But we’re also realists, and know the demands of the day-to-day work of advancing opportunity and promoting well-being for disadvantaged populations. With that in mind, we are providing this roadmap to enable readers with different needs to start with the essays most likely to be of interest to them.

For everyone, but especially those who are relatively new to understanding the promise of today’s data for communities, the opening essay is a useful summary and primer. Similarly, the final essay provides both a synthesis of the book’s primary themes and a focus on the systems challenges ahead.

Section 2, Transforming Data into Policy-Relevant Information (Data for Policy), offers a glimpse into the array of data tools and approaches that advocates, planners, investors, developers and others are currently using to inform and shape local and regional processes.

Section 3, Enhancing Data Access and Transparency (Access and Transparency), should catch the eye of those whose interests are in expanding the range of data that is commonly within reach and finding ways to link data across multiple policy and program domains, all while ensuring that privacy and security are respected.

Section 4, Strengthening the Validity and Use of Data (Strengthening Validity), will be particularly provocative for those concerned about building the capacity of practitioners and policymakers to employ appropriate data for understanding and shaping community change.

The essays in section 5, Adopting More Strategic Practices (Strategic Practices), examine the roles that practitioners, funders, and policymakers all have in improving the ways we capture the multi-faceted nature of community change, communicate about the outcomes and value of our work, and influence policy at the national level.

There are of course interconnections among the essays in each section. We hope that wherever you start reading, you’ll be inspired to dig deeper into the book’s enormous richness, and will join us in an ongoing conversation about how to employ the ideas in this volume to advance policy and practice.

Thirty-one (31) essays by dozens of authors on data and its role in public policy making.

From the acknowledgements:

This book is a joint project of the Federal Reserve Bank of San Francisco and the Urban Institute. The Robert Wood Johnson Foundation provided the Urban Institute with a grant to cover the costs of staff and research that were essential to this project. We also benefited from the field-building work on data from Robert Wood Johnson grantees, many of whom are authors in this volume.

If you are pitching data and/or data projects where the Federal Reserve Bank of San Francisco/Urban Institute set the tone of policy making conversations, a must read. It is likely to have an impact on other policy discussions, but adjusted for local concerns and conventions. You could also use it to shape your local policy discussions.

I first saw this in There is no seamless link between data and transparency by Jennifer Tankard.

Special Issue on Visionary Ideas in Data Management

Friday, January 9th, 2015

SIGMOD Record Call For Papers: Special Issue on Visionary Ideas in Data Management Guest Editor: Jun Yang Editor-in-Chief: Yanlei Diao.

Important Dates

Submission deadline: March 15, 2015
Publication of the special issue: June 2015

From the announcement:

This special issue of SIGMOD Record seeks papers describing visions of future systems, frameworks, algorithms, applications, and technology related to the management or use of data. The goal of this special issue is to promote the discussion and sharing of challenges and ideas that are not necessarily well-explored at the time of writing, but have potential for significantly expanding the possibilities and horizons of the field of databases and data management. The submissions will be evaluated on their originality, significance, potential impact, and interest to the community, with less emphasis on the current level of maturity, technical depth, and evaluation.

Submission Guidelines:

If “visionary” means not yet widely implemented, I think topic maps would easily qualify for this issue. From HDFS to CSV, I haven’t seen another solution for documenting the identity of subjects in data sets. Thoughts? (Modulo the CSV work I mentioned from the W3C quite recently. CSV on the Web:… [ .csv 5,250,000, .rdf 72,700].)

Conference on Innovative Data Systems Research (CIDR) 2015 Program + Papers!

Friday, January 2nd, 2015

Conference on Innovative Data Systems Research (CIDR) 2015

From the homepage:

The biennial Conference on Innovative Data Systems Research (CIDR) is a systems-oriented conference, complementary in its mission to the mainstream database conferences like SIGMOD and VLDB, emphasizing the systems architecture perspective. CIDR gathers researchers and practitioners from both academia and industry to discuss the latest innovative and visionary ideas in the field.

Papers are invited on novel approaches to data systems architecture and usage. Conference Venue CIDR mainly encourages papers about innovative and risky data management system architecture ideas, systems-building experience and insight, resourceful experimental studies, provocative position statements. CIDR especially values innovation, experience-based insight, and vision.

As usual, the conference will be held at the Asilomar Conference Grounds on the Pacific Ocean just south of Monterey, CA. The program will include: keynotes, paper presentations, panels, a gong-show and plenty of time for interaction.

The conference runs January 4 – 7, 2015 (starts next Monday). If you aren’t lucky enough to attend, the program has links to fifty-four (54) papers for your reading pleasure.

The program was exported from a “no-sense-of-abstraction” OOXML application. Conversion to re-usable form will take a few minutes. I will produce an author-sorted version this weekend.

In the meantime, enjoy the papers!

Data Blog Aggregation – Coffeehouse

Thursday, October 2nd, 2014


From the about page:

Coffeehouse aggregates posts about data management from around the internet.

The idea for this site draws inspiration from other aggregators such as Ecobloggers and R-Bloggers.

Coffeehouse is a project of DataONE, the Data Observation Network for Earth.

Posts are lightly curated. That is, all posts are brought in, but if we see posts that aren’t on topic, we take them down from this blog. They are not of course taken down from the original poster, just this blog.

Recently added data blogs:

Archive and Data Management Training Center

We believe that the character and structure of the social science research environment determines attitudes to re-use.

We also believe a healthy research environment gives researchers incentives to confidently create re-usable data, and for data archives and repositories to commit to supporting data discovery and re-use through data enhancement and long-term preservation.

The purpose of our center is to ensure excellence in the creation, management, and long-term preservation of research data. We promote the adoption of standards in research data management and archiving to support data availability, re-use, and the repurposing of archived data.

Our desire is to see the European research area producing quality data with wide and multipurpose re-use value. By supporting multipurpose re-use, we want to help researchers, archives and repositories realize the intellectual value of public investment in academic research. (From the “about” page for the Archive and Data Management Training Center website but representative of the blog as well)

Data Ab Initio

My name is Kristin Briney and I am interested in all things relating to scientific research data.

I have been in love with research data since working on my PhD in Physical Chemistry, when I preferred modeling and manipulating my data to actually collecting it in the lab (or, heaven forbid, doing actual chemistry). This interest in research data led me to a Master’s degree in Information Studies where I focused on the management of digital data.

This blog is something I wish I had when I was a practicing scientist: a resource to help me manage my data and navigate the changing landscape of research dissemination.

Digital Library Blog (Stanford)

The latest news and milestones in the development of Stanford’s digital library–including content, new services, and infrastructure development.

Dryad News and Views

Welcome to Dryad news and views, a blog about news and events related to the Dryad digital repository. Subscribe, comment, contribute– and be sure to Publish Your Data!

Dryad is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. Any journal or publisher that wishes to encourage data archiving may refer authors to Dryad. Dryad welcomes data submissions related to any published, or accepted, peer reviewed scientific and medical literature, particularly data for which no specialized repository exists.

Journals can support and facilitate their authors’ data archiving by implementing “submission integration,” by which the journal manuscript submission system interfaces with Dryad. In a nutshell: the journal sends automated notifications to Dryad of new manuscripts, which enables Dryad to create a provisional record for the article’s data, thereby streamlining the author’s data upload process. The published article includes a link to the data in Dryad, and Dryad links to the published article.

The Dryad documentation site provides complete information about Dryad and the submission integration process.

Dryad staff welcome all inquiries. Thank you.


The data deluge refers to the increasingly large and complex data sets generated by researchers that must be managed by their creators with “industrial-scale data centres and cutting-edge networking technology” (Nature 455) in order to provide for use and re-use of the data.

The lack of standards and infrastructure to appropriately manage this (often tax-payer funded) data requires data creators, data scientists, data managers, and data librarians to collaborate in order to create and acquire the technology required to provide for data use and re-use.

This blog is my way of sorting through the technology, management, research and development that have come together to successfully solve the data deluge. I will post and discuss both current and past R&D in this area. I welcome any comments.

There are fourteen (14) data blogs to date feeding into Coffeehouse. Unlike some data blog aggregations, ads do not overwhelm content at Coffeehouse.

If you have a data blog, please consider adding it to Coffeehouse. Suggest that other data bloggers do the same.

Open source datacenter computing with Apache Mesos

Monday, September 15th, 2014

Open source datacenter computing with Apache Mesos by Sachin P. Bappalige.

From the post:

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

Mesos leverages features of the modern kernel—”cgroups” in Linux, “zones” in Solaris—to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various web servers, databases and application servers.

This introduction to Apache Mesos will give you a quick overview of what Mesos has to offer without getting bogged down in details. Details will come later, either if you want to run a datacenter using Mesos or to map a datacenter being run with Mesos.

US Government Content Processing: A Case Study

Monday, March 24th, 2014

US Government Content Processing: A Case Study by Stephen E Arnold.

From the post:

I know that the article “Sinkhole of Bureaucracy” is an example of a single case example. Nevertheless, the write up tickled my funny bone. With fancy technology,, and the hyper modern content processing systems used in many Federal agencies, reality is stranger than science fiction.

This passage snagged my attention:

inside the caverns of an old Pennsylvania limestone mine, there are 600 employees of the Office of Personnel Management. Their task is nothing top-secret. It is to process the retirement papers of the government’s own workers. But that system has a spectacular flaw. It still must be done entirely by hand, and almost entirely on paper.

One of President Obama’s advisors is quote as describing the manual operation as “that crazy cave.”

Further in the post Stephen makes a good point when he suggests that in order to replace this operation you would first have to understand it.

But having said that, holding IT contractors accountable for failure would go a long way towards encouraging such understanding.

So far as I know, there have been no consequences for the IT contractors responsible for the meltdown.

Perhaps that is the first sign of IT management incompetence, no consequences for IT failures.



Saturday, September 28th, 2013

MANTRA: Free, online course on how to manage digital data by Sarah Dister.

From the post:

Research Data MANTRA is a free, online course with guidelines on how to manage the data you collect throughout your research. The course is particularly appropriate for those who work or are planning to work with digital data.

Once you have finalized the course, you will:

  • Be aware of the risk of data loss and data protection requirements.
  • Know how to store and transport your data safely and securely (backup and encryption).
  • Have experience in using data in software packages such as R, SPSS, NVivo, or ArcGIS.
  • Recognise the importance of good research data management practice in your own context.
  • Be able to devise a research data management plan and apply it throughout the projects life.
  • Be able to organise and document your data efficiently during the course of your project.
  • Understand the benefits of sharing data and how to do it legally and ethically.

Data management may not be as sexy as “big data” but without it, there would be no “big data” to make some of us sexy. 😉

Poderopedia Plug & Play Platform

Wednesday, July 17th, 2013

Poderopedia Plug & Play Platform

From the post:

Poderopedia Plug &amp; Play Platform is a Data Intelligence Management System that allows you to create and manage large semantic datasets of information about entities, map and visualize entity connections, include entity related documents, add and show sources of information and news mentions of entities, displaying all the information in a public or private website, that can work as a standalone product or as a public searchable database that can interoperate with a Newsroom website, for example, providing rich contextual information for news content using it`s archive.

Poderopedia Plug &amp; Play Platform is a free open source software developed by the Poderomedia Foundation, thanks to the generous support of a Knight News Challenge 2011 grant by the Knight Foundation, a Startup Chile 2012 grant and a 2013 Knight fellowship grant by the International Center for Journalists (ICFJ).


For anything that involves mapping entities and connections.

A few real examples:

  • NewsStack, an Africa News Challenge Winner, will use it for a pan-African investigation by 10 media organizations into the continent’s extractive industries.
  • Newsrooms from Europe and Latin America want to use it to make their own public searchable databases of entities, reuse their archive to develop new information products, provide context to new stories and make data visualizations&#8212;something like making their own Crunchbase.

Other ideas:

  • Use existing data to make searchable databases and visualizations of congresspeople, bills passed, what they own, who funds them, etc.
  • Map lobbyists and who they lobby and for whom
  • Create a NBApedia, Baseballpedia or Soccerpedia. Show data and connections about team owners, team managers, players, all their stats, salaries and related business
  • Map links between NSA, Prism and Silicon Valley
  • Keep track of foundation grants, projects that received funding, etc.
  • Anything related to data intelligence


Plug &amp; Play allows you to create and manage entity profile pages that include: short bio or summary, sheet of connections, long newsworthy profiles, maps of connections of an entity, documents related to the entity, sources of all the information and news river with external news about the entity.

Among several features (please see full list here) it includes:

  • Entity pages
  • Connections data sheet
  • Data visualizations without coding
  • Annotated documents repository
  • Add sources of information
  • News river
  • Faceted Search (using Solr)
  • Semantic ontology to express connections
  • Republish options and metrics record
  • View entity history
  • Report errors and inappropriate content
  • Suggest connections and new entities to add
  • Needs updating alerts
  • Send anonymous tips

Hmmm, when they say:

For anything that involves mapping entities and connections.

Topic maps would say:

For anything that involves mapping subjects and associations.

Poderopedia does lack is a notion of subject identity that would support “merging.”

I am going to install Poderopedia locally and see what the UI is like.

Appreciate your comments and reports if you do the same.

Plus suggestions about adding topic map capabilities to Poderopedia.

I first saw this in Nat Torkington’s Four Short Links: 5 July 2013.

Information Management – Gartner 2013 “Predictions”

Wednesday, April 3rd, 2013

I hesitate to call Gartner reports “predictions.”

The public ones I have seen are c-suite summaries of information already known to the semi-informed.

Are Gartner “predictions” about what c-suite types may become informed about in the coming year?

That qualifies for the dictionary sense of “prediction.”

More importantly, what c-suite types may become informed about are clues on how to promote topic maps.

If you don’t have access to the real Gartner reports, Andy Price has summarized information management predictions in: IT trends: Gartner’s 2013 predictions for information management.

The ones primarily relevant to topic maps are:

  • Big data
  • Semantic technologies
  • The logical data warehouse
  • Information stewardship applications
  • Information valuation/infonomics

One possible way to capitalize on these “predictions” would be to create a word cloud from the articles reporting on these “predictions.”

Every article with use slightly different language and the most popular terms are the ones to use for marketing.

Thinking they will be repeated often enough to resonate with potential customers.

Capturing the business needs answered by those terms would be a separate step.

Project Falcon…

Wednesday, April 3rd, 2013

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source by Venkatesh Seetharam.

From the post:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

Falcon workflow

I am certain a topic map based workflow solution could be created.

However, using a solution being promoted by others removes one thing from the topic map “to do” list.

Not to mention giving topic maps an introduction to other communities.

Research Data Symposium – Columbia

Saturday, March 9th, 2013

Research Data Symposium – Columbia.

Posters from the Research Data Symposium, held at Columbia University, February 27, 2013.

Subject to the limitations of the poster genre but useful as a quick overview of current projects and directions.

Data Governance needs Searchers, not Planners

Wednesday, March 6th, 2013

Data Governance needs Searchers, not Planners by Jim Harris.

From the post:

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance. A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible. Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

If you don’t remember any line from any post you read here or elsewhere, remember this one:

“…they rely on their own common sense to reason about the behavior of people who are different from them.”

Whenever you encounter a situation where that description fits, you will find failed projects, waste and bad morale.

Why Most BI Programs Under-Deliver Value

Sunday, February 10th, 2013

Why Most BI Programs Under-Deliver Value by Steve Dine.

From the post:

Business intelligence initiatives have been undertaken by organizations across the globe for more than 25 years, yet according to industry experts between 60 and 65 percent of BI projects and programs fail to deliver on the requirements of their customers.

This impact of this failure reaches far beyond the project investment, from unrealized revenue to increased operating costs. While the exact reasons for failure are often debated, most agree that a lack of business involvement, long delivery cycles and poor data quality lead the list. After all this time, why do organizations continue to struggle with delivering successful BI? The answer lies in the fact that they do a poor job at defining value to the customer and how that value will be delivered given the resource constraints and political complexities in nearly all organizations.

BI is widely considered an umbrella term for data integration, data warehousing, performance management, reporting and analytics. For the vast majority of BI projects, the road to value definition starts with a program or project charter, which is a document that defines the high level requirements and capital justification for the endeavor. In most cases, the capital justification centers on cost savings rather than value generation. This is due to the level of effort required to gather and integrate data across disparate source systems and user developed data stores.

As organizations mature, the number of applications that collect and store data increase. These systems usually contain few common unique identifiers to help identify related records and are often referred to as data silos. They also can capture overlapping data attributes for common organizational entities, such as product and customer. In addition, the data models of these systems are usually highly normalized, which can make them challenging to understand and difficult for data extraction. These factors make cost savings, in the form of reduced labor for data collection, easy targets. Unfortunately, most organizations don’t eliminate employees when a BI solution is implemented; they simply work on different, hopefully more value added, activities. From the start, the road to value is based on a flawed assumption and is destined to under deliver on its proposition.

This post merits a close read, several times.

In particular I like the focus on delivery of value to the customer.

Err, that would be the person paying you to do the work.

Steve promises a follow-up on “lean BI” that focuses on delivering more value that it costs to deliver.

I am inherently suspicious of “lean” or “agile” approaches. I sat on a committee that was assured by three programmers they had improved upon IBM’s programming methodology but declined to share the details.

Their requirements document for a content management system, to be constructed on top of subversion, was a paragraph in an email.

Fortunately the committee prevailed upon management to tank the project. The programmers persist, management being unable or unwilling to correct past mistakes.

I am sure there are many agile/lean programming projects that deliver well documented, high quality results.

But I don’t start with the assumption that agile/lean or other methodology projects are well documented.

That is a question of fact. One that can be answered.

Refusal to answer due to time or resource constraints, is a very bad sign.

I first saw this in a top ten tweets list from KDNuggets.

Mule ESB 3.3.1

Thursday, September 13th, 2012

Mule ESB 3.3.1 by Ramiro Rinaudo.

I got the “memo” on 4 September 2012 but it got lost in my inbox. Sorry.

From the post:

Mule ESB 3.3.1 represents a significant amount of effort on the back of Mule ESB 3.3 and our happiness with the result is multiplied by the number of products that are part of this release. We are releasing new versions with multiple enhancements and bug fixes to all of the major stack components in our Enterprise Edition. This includes:

Are You An IT Hostage?

Monday, August 13th, 2012

As I promised last week in From Overload to Impact: An Industry Scorecard on Big Data Business Challenges [Oracle Report], the key finding that is missing from Oracle’s summary:

Executives’ Biggest Data Management Gripes:*

#1 Don’t have the right systems in place to gather the information we need (38%)

#2 Can’t give our business managers access to the information they need; need to rely on IT (36%)

Ask your business managers: Do they feel like IT hostages?

You are likely to be surprised at the answers you get.

IT’s vocabulary acts as an information clog.

A clog that impedes the flow of information in your organization.

Information that can improve the speed and quality of business decision making.

The critical point is: Information clogs are bad for business.

Do you want to borrow my plunger?

From Overload to Impact: An Industry Scorecard on Big Data Business Challenges [Oracle Report]

Friday, August 10th, 2012

From Overload to Impact: An Industry Scorecard on Big Data Business Challenges [Oracle Report]


IT powers today’s enterprises, which is particularly true for the world’s most data-intensive industries. Organizations in these highly specialized industries increasingly require focused IT solutions, including those developed specifically for their industry, to meet their most pressing business challenges, manage and extract insight from ever-growing data volumes, improve customer service, and, most importantly, capitalize on new business opportunities.

The need for better data management is all too acute, but how are enterprises doing? Oracle surveyed 333 C-level executives from U.S. and Canadian enterprises spanning 11 industries to determine the pain points they face regarding managing the deluge of data coming into their organizations and how well they are able to use information to drive profit and growth.

Key Findings:

  • 94% of C-level executives say their organization is collecting and managing more business information today than two years ago, by an average of 86% more
  • 29% of executives give their organization a “D” or “F” in preparedness to manage the data deluge
  • 93% of executives believe their organization is losing revenue – on average, 14% annually – as a result of not being able to fully leverage the information they collect
  • Nearly all surveyed (97%) say their organization must make a change to improve information optimization over the next two years
  • Industry-specific applications are an important part of the mix; 77% of organizations surveyed use them today to run their enterprise—and they are looking for more tailored options

What key finding did they miss?

They cover it in the forty-two (42) page report but it doesn’t appear here.

Care to guess what it is?

Forgotten key finding post coming Monday, 13 August 2012. Watch for it!

I first saw this at Beyond Search.

Data citation initiatives and issues

Monday, June 25th, 2012

Data citation initiatives and issues by Matthew S. Mayernik (Bulletin of the American Society for Information Science and Technology Volume 38, Issue 5, pages 23–28, June/July 2012)


The importance of formally citing scientific research data has been recognized for decades but is only recently gaining momentum. Several federal government agencies urge data citation by researchers, DataCite and its digital object identifier registration services promote the practice of citing data, international citation guidelines are in development and a panel at the 2012 ASIS&T Research Data Access and Preservation Summit focused on data citation. Despite strong reasons to support data citation, the lack of individual user incentives and a pervasive cultural inertia in research communities slow progress toward broad acceptance. But the growing demand for data transparency and linked data along with pressure from a variety of stakeholders combine to fuel effective data citation. Efforts promoting data citation must come from recognized institutions, appreciate the special characteristics of data sets and initially emphasize simplicity and manageability.

This is an important and eye-opening article on the state of data citations and issues related to it.

I found it surprising in part because citation of data in radio and optical astronomy has long been commonplace. In part because for decades now, the astronomical community has placed a high value on public archiving of research data as it is acquired, both in raw and processed formats.

As pointed out in this paper, without public archiving, there can be no effective form of data citation. Sad to say, the majority of data never makes it to public archives.

Given the reliance on private and public sources of funding for research, public archiving and access should be guaranteed as a condition of funding. Researchers would be free to continue to not make their data publicly accessible, should they choose to fund their own work.

If that sounds harsh, consider the well deserved amazement at the antics over access to the Dead Sea Scrolls.

If the only way for your opinion/analysis to prevail is to deny others access to the underlying data, that is all the commentary the community needs on your work.

Cascading 2.0

Thursday, June 7th, 2012

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

This release includes a number of new features. Specifically:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Cascading homepage.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.

Data Management is Based on Philosophy, Not Science

Tuesday, May 1st, 2012

Data Management is Based on Philosophy, Not Science by Malcolm Chisholm.

From the post:

There’s a joke running around on Twitter that the definition of a data scientist is “a data analyst who lives in California.” I’m sure the good natured folks of the Golden State will not object to me bringing this up to make a point. The point is: Thinking purely in terms of marketing, which is a better title — data scientist or data philosopher?

My instincts tell me there is no contest. The term data scientist conjures up an image of a tense, driven individual, surrounded by complex technology in a laboratory somewhere, wrestling valuable secrets out of the strange substance called data. By contrast, the term data philosopher brings to mind a pipe-smoking elderly gentleman sitting in a winged chair in some dusty recess of academia where he occasionally engages in meaningless word games with like-minded individuals.

These stereotypes are obviously crude, but they are probably what would come into the minds of most executive managers. Yet how true are they? I submit that there is a strong case that data management is much more like applied philosophy than it is like applied science.

Applied philosophy. I like that!

You know where I am going to come out on this issue so I won’t belabor it.

Enjoy reading Malcolm’s post!

When It Comes to Data Quality Delivery, the Soft Stuff is the Hard Stuff (Part 1 of 6)

Saturday, March 10th, 2012

When It Comes to Data Quality Delivery, the Soft Stuff is the Hard Stuff (Part 1 of 6) by Richard Trapp.

From the post:

I regularly receive questions regarding the types of skills data quality analysts should have in order to be effective. In my experience, regardless of scope, high performing data quality analysts need to possess a well-rounded, balanced skill set – one that marries technical “know how” and aptitude with a solid business understanding and acumen. But, far too often, it seems that undue importance is placed on what I call the data quality “hard skills”, which include; a firm grasp of database concepts, hands on data analysis experience using standard analytical tool sets, expertise with commercial data quality technologies, knowledge of data management best practices and an understanding of the software development life cycle.

Read Richard’s post to get the listing of “soft skills” and evaluate yourself.

I am going to track this series and will post updates here.

Being successful with “big data,” semantic integration, whatever the next buzz words are, will require a mix of hard and soft skills.

Success has always required both hard and soft skills, but it doesn’t hurt to repeat the lesson.

Selling Data Mining to Management

Sunday, February 19th, 2012

Selling Data Mining to Management by Sandro Saitta.

From the post:

Preparing data and building data mining models are two very well documented steps of analytics projects. However, whatever interesting your results are, they are useless if no action is taken. Thus, the step from analytics to action is a crucial one in any analytics project. Imagine you have the best data and found the best model of all time. You need to industrialize the data mining solution to make your company benefits from them. Often, you will first need to sell your project to the management.

Sandro references three very good articles on pitching data management/mining/analytics to management.

I would rephrase Sandra’s opening line to read: “Preparing data [for a topic map] and building [a topic map] are two very well documented steps of [topic map projects]. However, whatever interesting your results are, [there is no revenue if no one buys the map].”

OK, maybe I am being generous on the preparing data and building a topic map points but you can see where the argument is going.

And there are successful topic map merchants with active clients, just not enough of either one.

These papers maybe the push in the right direction to get more of them.

First Look — Talend

Saturday, January 7th, 2012

First Look — Talend

From the post:

Talend has been around for about 6 years and the original focus was on “democratizing” data integration – making it cheaper, easier, quicker and less maintenance-heavy. They originally wanted to build an open source alternative for data integration. In particular they wanted to make sure that there was a product that worked for smaller companies and smaller projects, not just for large data warehouse efforts.

Talend has 400 employees in 8 countries and 2,500 paying customers for their Enterprise product. Talend uses an “open core” philosophy where the core product is open source and the enterprise version wraps around this as a paid product. They have expanded from pure data integration into a broader platform with data quality and MDM and a year ago they acquired an open source ESB vendor and earlier this year released a Talend branded version of this ESB.

I have the Talend software but need to spend some time working through the tutorials, etc.

A review from a perspective of subject identity and re-use of subject identification.

It may help me to simply start posting as I work through the software rather than waiting to create an edited review of the whole. Which I could always fashion from the pieces if it looked useful.

Watch for the start of my review of Talend this next week.

What the Sumerians can teach us about data

Tuesday, January 3rd, 2012

What the Sumerians can teach us about data

Pete Warden writes:

I spent this afternoon wandering the British Museum’s Mesopotamian collection, and I was struck by what the humanities graduates in charge of the displays missed. The way they told the story, the Sumerian’s biggest contribution to the world was written language, but I think their greatest achievement was the invention of data.

Writing grew out of pictograms that were used to tally up objects or animals. Historians and other people who write for a living treat that as a primitive transitional use, a boring stepping-stone to the final goal of transcribing speech and transmitting stories. As a data guy, I’m fascinated by the power that being able to capture and transfer descriptions of the world must have given the Sumerians. Why did they invent data, and what can we learn from them?

Although Pete uses the term “Sumerians” to cover a very wide span of peoples, languages and history, I think his comment:

Gathering data is not a neutral act, it will alter the power balance, usually in favor of the people collecting the information.

is right on the mark.

There aspect of data management that we can learn from the Ancient Near East (not just the Sumerians).

Preservation of access.

It isn’t enough to simply preserve data. You can ask NASA preservation of data. (Houston, We Erased The Apollo 11 Tapes)

Particularly with this attitude:

“We’re all saddened that they’re not there. We all wish we had 20-20 hindsight,” says Dick Nafzger, a TV specialist at NASA’s Goddard Space Flight Center in Maryland, who helped lead the search team.

“I don’t think anyone in the NASA organization did anything wrong,” Nafzger says. “I think it slipped through the cracks, and nobody’s happy about it.”

Didn’t do anything wrong?

You do know the leading cause for firing of sysadmins is failure to maintain proper backups? I would hold everyone standing near a crack responsible. Would not bring the missing tapes back but it would make future generations more careful.

Considering that was only a few decades ago, how do we read ancient texts for which we have no key in English?

The ancients preserved access to their data by way of triliteral inscriptions. Inscriptions in three different languages but all saying the same thing. If you know only one of the languages you can work towards understanding the other two.

A couple of examples:

Van Fortress, with an inscription of Xerxes the Great.

Behistun Inscription, with an inscription in Old Persian, Elamite, and Babylonian.

BTW, the final image in Pete’s post is much later than the Sumerians and is one of the first cuneiform artifacts to be found. (Taylor’s Prism) It describes King Sennacherib’s military victories and dates from about 691 B.C. It is written in Neo-Assyrian cuneiform script. That script is used in primers and introductions to Akkadian.

Can I guess how many mappings you have of your ontologies or database schemas? I suppose the first question should be if they are documented at all? Then follow up with the question of about mapping to other ontologies or schemas. Such as an industry standard schema or set of terms.

If that sounds costly, consider the cost of migration/integration without documentation/mapping. Topic maps can help with the mapping aspects of such a project.

Webdam Project: Foundations of Web Data Management

Saturday, December 31st, 2011

Webdam Project: Foundations of Web Data Management

From the homepage:

The goal of the Webdam project is to develop a formal model for Web data management. This model will open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.

Books from the project:

  • Foundation of Database, Serge Abiteboul, Rick Hull, Victor Vianu, open access online edition
  • Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart, open access online edition
  • Modeling, Querying and Mining Uncertain XML Data Evgeny Kharlamov and Pierre Senellart, , In A. Tagarelli, editor, XML Data Mining: Models, Methods, and Applications. IGI Global, 2011. open access online edition

I discovered this project via a link to “Web Data Management and Distribution” in Christophe Lalanne’s A bag of tweets / Dec 2011, that pointed to the PDF file, some 400 pages. I went looking for the HTML page with the link and discovered this project along with these titles.

There are a number of other publications associated with the project that you may find useful. The “Querying and Mining Uncertain XML” is only a chapter out of a larger publication by IGI Global. About what one expects from IGI Global. Cambrige Press published the title just proceeding this chapter and allows download for personal use of the entire book.

I think there is a lot to be learned from this project, even if it has not resulted in a universal framework for web applications that exchange data. I don’t think we are in any danger of universal frameworks on or off the web. And we are better for it.