Archive for the ‘Files’ Category

Making Statistics Files Accessible

Sunday, March 8th, 2015

Making Statistics Files Accessible by Evan Miller.

From the post:

There’s little in modern society more frustrating than receiving a file from someone and realizing you’ll need to buy a jillion-dollar piece of software in order to open it. It’s like, someone just gave you a piece of birthday cake, but you’re only allowed to eat that cake with a platinum fork encrusted with diamonds, and also the fork requires you to enter a serial number before you can use it.

Wizard often receives praise for its clean statistics interface and beautiful design, but I’m just as proud of another part of the software that doesn’t receive much attention, ironically for the very reason that it works so smoothly: the data importers. Over the last couple of years I’ve put a lot of effort into understanding and picking apart various popular file formats; and as a result, Wizard can slurp down Excel, Numbers, SPSS, Stata, and SAS files like it was a bowl of spaghetti at a Shoney’s restaurant.

Of course, there are a lot of edge cases and idiosyncrasies in binary files, and it takes a lot of mental effort to keep track of all the peculiarities; and to be honest I’d rather spend that effort making a better interface instead of bashing my head against a wall over some binary flag field that I really, honestly have no interest in learning more about. So today I’m happy to announce that the file importers are about to get even smoother, and at the same time, I’ll be able to put more of my attention on the core product rather than worrying about file format issues.

The astute reader will ask: how will a feature that starts receiving less attention from me get better? It’s simple: I’ve open-sourced Wizard’s core routines for reading SAS, Stata, and SPSS files, and as of today, these routines are available to anyone who uses R — quite a big audience, which means that many more people will be available to help me diagnose and fix issues with the file importers.

In case you don’t recognize the Wizard software, there’s a reason the site has “mac” in its name: 😉

Why Extended Attributes are Coming to HDFS

Saturday, June 28th, 2014

Why Extended Attributes are Coming to HDFS by Charles Lamb.

From the post:

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Recently, my Intel colleague Yi Liu led the implementation of extended attributes for HDFS (HDFS-2006). This work is largely motivated by Cloudera and Intel contributions to bringing at-rest encryption to Apache Hadoop (HDFS-6134; also see this post) under Project Rhino – extended attributes will be the mechanism for associating encryption key metadata with files and encryption zones — but it’s easy to imagine lots of other places where they could be useful.

For instance, you might want to store a document’s author and subject in sometime like and user.subject=HDFS. You could store a file checksum in an attribute called user.checksum. Even just comments about a particular file or directory can be saved in an extended attribute.

In this post, you’ll learn some of the details of this feature from an HDFS user’s point of view.

Extended attributes sound like an interesting place to tuck away additional information about a file.

Such as the legend to be used to interpret it?

ROOT Files

Friday, March 21st, 2014

ROOT Files

From the webpage:

Today, a huge amount of data is stored into files present on our PC and on the Internet. To achieve the maximum compression, binary formats are used, hence they cannot simply be opened with a text editor to fetch their content. Rather, one needs to use a program to decode the binary files. Quite often, the very same program is used both to save and to fetch the data from those files, but it is also possible (and advisable) that other programs are able to do the same. This happens when the binary format is public and well documented, but may happen also with proprietary formats that became a standard de facto. One of the most important problems of the information era is that programs evolve very rapidly, and may also disappear, so that it is not always trivial to correctly decode a binary file. This is often the case for old files written in binary formats that are not publicly documented, and is a really serious risk for the formats implemented in custom applications.

As a solution to these issues ROOT provides a file format that is a machine-independent compressed binary format, including both the data and its description, and provides an open-source automated tool to generate the data description (or “dictionary“) when saving data, and to generate C++ classes corresponding to this description when reading back the data. The dictionary is used to build and load the C++ code to load the binary objects saved in the ROOT file and to store them into instances of the automatically generated C++ classes.

ROOT files can be structured into “directories“, exactly in the same way as your operative system organizes the files into folders. ROOT directories may contain other directories, so that a ROOT file is more similar to a file system than to an ordinary file.

Amit Kapadia mentions ROOT files in his presentation at CERN on citizen science.

I have only just begun to read the documentation but wanted to pass this starting place along to you.

I don’t find the “machine-independent compressed binary format” argument all that convincing but apparently it has in fact worked for quite some time.

Of particular interest will be the data dictionary aspects of ROOT.

Other data and description capturing file formats?

ZFS disciples form one true open source database

Sunday, September 22nd, 2013

ZFS disciples form one true open source database by Lucy Carey.

From the post:

The acronym ‘ZFS’ may no longer actually stand for anything, but the “world’s most advanced file sharing system” is in no way redundant. Yesterday, it emerged that corporate advocates of the Sun Microsystems file system and logical volume manager have joined together to offer a new “truly open source” incarnation of the file system, called, fittingly enough, OpenZFS.

Along with the the launch of the website – which is incidentally, a domain owned by ZFS co-founder Matt Ahrens- the group of ZFS lovers, which includes developers from the illumos, FreeBSD, Linux, and OS X platforms, as well as an assortment of other parties who are building products on top of OpenZFS, have set out a clear set of objectives.

Speaking of scaling, Wikipedia reports:

A ZFS file system can store up to 256 quadrillion Zebibytes (ZiB).

Just in case anyone mentions scalable storage as an issue. 😉

Ceph: A Scalable, High-Performance Distributed File System

Saturday, May 4th, 2013

Ceph: A Scalable, High-Performance Distributed File System by Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn.


We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

I have just started reading this paper but it strikes me as deeply important.


Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery. Ceph utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability of metadata access, and with it, the scalability of the entire system. We discuss the goals and workload assumptions motivating our choices in the design of the architecture, analyze their impact on system scalability and performance, and relate our experiences in implementing a functional system prototype.

The ability to scale “metadata,” in this case inodes and directory entries (file names), bodes well for scaling topic map based information about files.

Not to mention that experience with generating functions may free us from the overhead of URI based addressing.

For some purposes, I may wish to act as though only files exist but in a separate operation, I may wish to address discrete tokens or even characters in one such file.

Interesting work and worth a deep read.

The source code for Ceph:

National Archives Digitization Tools Now on GitHub

Saturday, October 22nd, 2011

National Archives Digitization Tools Now on GitHub

From the post:

As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.

We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.

I suspect many government departments (U.S. and otherwise) have similar digitization workflow efforts underway. Perhaps greater publicity about these efforts will cause other departments to step forward.