The Data-Scope Project – 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops
From the post:
While Galileo played life and death doctrinal games over the mysteries revealed by the telescope, another revolution went unnoticed, the microscope gave up mystery after mystery and nobody yet understood how subversive would be what it revealed. For the first time these new tools of perceptual augmentation allowed humans to peek behind the veil of appearance. A new new eye driving human invention and discovery for hundreds of years.
Data is another material that hides, revealing itself only when we look at different scales and investigate its underlying patterns. If the universe is truly made of information, then we are looking into truly primal stuff. A new eye is needed for Data and an ambitious project called Data-scope aims to be the lens.
A detailed paper on the Data-Scope tells more about what it is:
The Data-Scope is a new scientific instrument, capable of ‘observing’ immense volumes of data from various scientific domains such as astronomy, fluid mechanics, and bioinformatics. The system will have over 6PB of storage, about 500GBytes per sec aggregate sequential IO, about 20M IOPS, and about 130TFlops. The Data-Scope is not a traditional multi-user computing cluster, but a new kind of instrument, that enables people to do science with datasets ranging between 100TB and 1000TB. There is a vacuum today in data-intensive scientific computations, similar to the one that lead to the development of the BeoWulf cluster: an inexpensive yet efficient template for data intensive computing in academic environments based on commodity components. The proposed Data-Scope aims to fill this gap.
A very accessible interview by Nicole Hemsoth with Dr. Alexander Szalay, Data-Scope team lead, is available at The New Era of Computing: An Interview with “Dr. Data”. Roberto Zicari also has a good interview with Dr. Szalay in Objects in Space vs. Friends in Facebook.
I am not altogether convinced that the data/computing center model is the best one but the lessons learned here may hasten more sophisticated architectures.
Subject identity issues abound in any environment but some are easier to see in a complex one.
For example, what if the choices of researchers are captured as subject identifications and associations are created to other data set (or data within those sets) based on those choices?
Perhaps to power recommendations of additional data or notices of when additional data becomes available.