Archive for the ‘Files’ Category

Ceph: A Scalable, High-Performance Distributed File System

Saturday, May 4th, 2013

Ceph: A Scalable, High-Performance Distributed File System by Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn.

Abstract:

We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

I have just started reading this paper but it strikes me as deeply important.

Consider:

Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery. Ceph utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability of metadata access, and with it, the scalability of the entire system. We discuss the goals and workload assumptions motivating our choices in the design of the architecture, analyze their impact on system scalability and performance, and relate our experiences in implementing a functional system prototype.

The ability to scale “metadata,” in this case inodes and directory entries (file names), bodes well for scaling topic map based information about files.

Not to mention that experience with generating functions may free us from the overhead of URI based addressing.

For some purposes, I may wish to act as though only files exist but in a separate operation, I may wish to address discrete tokens or even characters in one such file.

Interesting work and worth a deep read.

The source code for Ceph: http://ceph.sourceforge.net/.

National Archives Digitization Tools Now on GitHub

Saturday, October 22nd, 2011

National Archives Digitization Tools Now on GitHub

From the post:

As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.

We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.

I suspect many government departments (U.S. and otherwise) have similar digitization workflow efforts underway. Perhaps greater publicity about these efforts will cause other departments to step forward.