Archive for the ‘Storage’ Category

How to spot first stories on Twitter using Storm

Wednesday, November 27th, 2013

How to spot first stories on Twitter using Storm by Michael Vogiatzis.

From the post:

As a first blog post, I decided to describe a way to detect first stories (a.k.a new events) on Twitter as they happen. This work is part of the Thesis I wrote last year for my MSc in Computer Science in the University of Edinburgh.You can find the document here.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Specifically, I try to identify the first document in a stream of documents, which discusses about a specific event. Let’s have a look into the implementation of the methods used.

Other resources of interest:

Slide deck by the same name.

Code on Github.

The slides were interesting and were what prompted me to search for and find the blog and Github materials.

An interesting extension to this technique would be to discover “new” ideas in papers.

Or particular classes of “new” ideas in news streams.

ZFS disciples form one true open source database

Sunday, September 22nd, 2013

ZFS disciples form one true open source database by Lucy Carey.

From the post:

The acronym ‘ZFS’ may no longer actually stand for anything, but the “world’s most advanced file sharing system” is in no way redundant. Yesterday, it emerged that corporate advocates of the Sun Microsystems file system and logical volume manager have joined together to offer a new “truly open source” incarnation of the file system, called, fittingly enough, OpenZFS.

Along with the the launch of the website – which is incidentally, a domain owned by ZFS co-founder Matt Ahrens- the group of ZFS lovers, which includes developers from the illumos, FreeBSD, Linux, and OS X platforms, as well as an assortment of other parties who are building products on top of OpenZFS, have set out a clear set of objectives.

Speaking of scaling, Wikipedia reports:

A ZFS file system can store up to 256 quadrillion Zebibytes (ZiB).

Just in case anyone mentions scalable storage as an issue. 😉

Ceph: A Scalable, High-Performance Distributed File System

Saturday, May 4th, 2013

Ceph: A Scalable, High-Performance Distributed File System by Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn.


We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

I have just started reading this paper but it strikes me as deeply important.


Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery. Ceph utilizes a highly adaptive distributed metadata cluster architecture that dramatically improves the scalability of metadata access, and with it, the scalability of the entire system. We discuss the goals and workload assumptions motivating our choices in the design of the architecture, analyze their impact on system scalability and performance, and relate our experiences in implementing a functional system prototype.

The ability to scale “metadata,” in this case inodes and directory entries (file names), bodes well for scaling topic map based information about files.

Not to mention that experience with generating functions may free us from the overhead of URI based addressing.

For some purposes, I may wish to act as though only files exist but in a separate operation, I may wish to address discrete tokens or even characters in one such file.

Interesting work and worth a deep read.

The source code for Ceph:


Wednesday, April 17th, 2013

Tachyon by UC Berkeley AMP Lab.

From the webpage:

Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.It offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that is frequently read.

Since we aren’t quite to in-memory computing just yet, you may want to review Tachyon.

The numbers are very impressive.

Storing Topic Map Data at $136/TB

Friday, October 5th, 2012

Steve Streza describes his storage system in My Giant Hard Drive: Building a Storage Box with FreeNAS.

At his prices, about $136/TB for 11 TB of storage.

Large enough for realistic simulations of data mining or topic mapping. When you want to step up to production, spin up services on one of the clouds.

Not sure it will last you several years as Steve projects but it should last long enough to be worth the effort.

From the post:

For many years, I’ve had a lot of hard drives being used for data storage. Movies, TV shows, music, apps, games, backups, documents, and other data have been moved between hard drives and stored in inconsistent places. This has always been the cheap and easy approach, but it has never been really satisfying. And with little to no redundancy, I’ve suffered a non-trivial amount of data loss as drives die and files get lost. Now, I’m not alone to have this problem, and others have figured out ways of solving it. One of the most interesting has been in the form of a computer dedicated to one thing: storing data, and lots of it. These computers are called network-attached storage, or NAS, computers. A NAS is a specialized computer that has lots of hard drives, a fast connection to the local network, and…that’s about it. It doesn’t need a high-end graphics card, or a 20-inch monitor, or other things we typically associate with computers. It just sits on the network and quietly serves and stores files. There are off-the-shelf boxes you can buy to do this, such as machines made by Synology or Drobo, and you can assemble one yourself for the job.

I’ve been considering making a NAS for myself for over a year, but kept putting it off due to expense and difficulty. But a short time ago, I finally pulled the trigger on a custom assembled machine for storing data. Lots of it; almost 11 terabytes of storage, in fact. This machine is made up of 6 hard drives, and is capable of withstanding a failure on two of them without losing a single file. If any drives do fail, I can replace them and keep on working. And these 11 terabytes act as one giant hard drive, not as 6 independent ones that have to be organized separately. It’s an investment in my storage needs that should grow as I need it to, and last several years.

Amazon Glacier: Archival Storage for One Penny Per GB Per Month

Tuesday, August 21st, 2012

Amazon Glacier: Archival Storage for One Penny Per GB Per Month by Jeff Barr.

From the post:

I’m going to bet that you (or your organization) spend a lot of time and a lot of money archiving mission-critical data. No matter whether you’re currently using disk, optical media or tape-based storage, it’s probably a more complicated and expensive process than you’d like which has you spending time maintaining hardware, planning capacity, negotiating with vendors and managing facilities.


If so, then you are going to find our newest service, Amazon Glacier, very interesting. With Glacier, you can store any amount of data with high durability at a cost that will allow you to get rid of your tape libraries and robots and all the operational complexity and overhead that have been part and parcel of data archiving for decades.

Glacier provides – at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte, per month – extremely low cost archive storage. You can store a little bit, or you can store a lot (Terabytes, Petabytes, and beyond). There’s no upfront fee and you pay only for the storage that you use. You don’t have to worry about capacity planning and you will never run out of storage space. Glacier removes the problems associated with under or over-provisioning archival storage, maintaining geographically distinct facilities and verifying hardware or data integrity, irrespective of the length of your retention periods.

With the caveat that you don’t have immediate access to your data (it is called “Glacier” for a reason), but it is still an impressive price.

Unless you are monitoring nuclear missile launch signatures or are a day trader, do you really need arbitrary and random access to all your data?

Or is that a requirement because you read some other department or agency was getting “real time” big data?

Polyglot Persistence?

Sunday, February 12th, 2012

The Future is Polyglot Persistence by Martin Fowler and Pramod Sadalage. (PDF file)

The crux is slide 7 where the authors observe in part:

Polyglot Persistence using multiple data storage technologies, chosen based on the way data is being used by individual applications. Why store binary images in relational database, when there are better storage systems.

Bringing Alex Popescu to observe:

There are over 2 years since I’ve begun evangelizing polyglot persistence. By now, most thought leaders agree it is the future. Next on my agenda is having the top relational vendors sign off too. Actually, I’m almost there: Oracle is promoting an Oracle NoSQL Database and Microsoft is offering both relational and non-relational solutions with Azure. They just need to say it. (The future is polyglot persistence)

I am very puzzled.

I am not sure how Alex could be “evangelizing polyglot persistence” or Martin and Pramod could be announcing its “discovery.”

Just in case you haven’t noticed, while SQL database are very popular, there are video storage/delivery systems, custom databases for scientific data, SGML/XML databases (for at least the last 20 some odd years) and others.

In other words, polyglot persistence has been a fact of the IT world from the beginning of persisted data.

Ask yourself: Who gains from confusing IT decision makers with fictional discoveries?

Tiered Storage Approaches to Big Data:…

Tuesday, December 13th, 2011

Tiered Storage Approaches to Big Data: Why look to the Cloud when you’re working with Galaxies?

Event Date: 12/15/2011 02:00 PM Eastern Standard Time

From the email:

The ability for organizations to keep up with the growth of Big Data in industries like satellite imagery, genomics, oil and gas, and media and entertainment has strained many storage environments. Though storage device costs continue to be driven down, corporations and research institutions have to look to setting up tiered storage environments to deal with increasing power and cooling costs and shrinking data center footprint of storing all this big data.

NASA’s Earth Observing System Data and Information Management (EOSDIS) is arguably a poster child when looking at large image file ingest and archive. Responsible for processing, archiving, and distributing Earth science satellite data (e.g., land, ocean and atmosphere data products), NASA EOSDIS handles hundreds of millions of satellite image data files averaging roughly from 7 MB to 40 MB in size and totaling over 3PB of data.

Discover long-term data tiering, archival, and data protection strategies for handling large files using a product like Quantum’s StorNext data management solution and similar solutions from a panel of three experts. Hear how NASA EOSDIS handles its data workflow and long term archival across four sites in North America and makes this data freely available to scientists.

Think of this as a starting point to learn some of the “lingo” in this area and perhaps hear some good stories about data and NASA.

Some questions to think about during the presentation/discussion:

How do you effectively access information after not only the terminology but the world view of a discipline has changed?

What do you have to know about the data and its storage?

How do the products discussed address those questions?