Personal PCs have TB disk storage. A TB of RAM isn’t far behind. Multi-TBs of both are available in high-end appliances.
One solution when v = volume
is to pump up the storage volume. But you can always find data sets that are “big data” for your current storage.
Fact is, “big data” has always outrun current storage. The question of how to store more data than convenient has been asked and answered before. I encountered one of those answers last night.
The abstract to the paper reads:
The Hubble Space Telescope (HST) generates on the order of 7,000 telemetry values, many of which are sampled at 1Hz, and with several hundred parameters being sampled at 40Hz. Such data volumes would quickly tax even the largest of processing facilities. Yet the ability to access the telemetry data in a variety of ways, and in particular, using ad hoc (i.e., no a priori fixed) queries, is essential to assuring the long term viability and usefulness of this instrument. As part of the recent NASA initiative to re-engineer HST’s ground control systems, a concept arose to apply newly available data warehousing technologies to this problem. The Space Telescope Science Institute was engaged to develop a pilot to investigate the technology and to create a proof-of-concept testbed that could be demonstrated and evaluated for operational use. This paper describes this effort and its results.
The authors framed their v = volume
problem as:
Then there’s the shear volume of the telemetry data. At its nominal format and rate, the HST generates over 3,000 monitored samples per second. Tracking each sample as a separate record would generate over 95 giga-records/year, or assuming a 16 year Life-of-Mission (LOM), 1.5 tera-records/LOM. Assuming a minimal 20 byte record per transaction yields 1.9 terabytes/year or 30 terabytes/LOM. Such volumes are supported by only the most exotic and expensive custom database systems made.
We may smile at the numbers now but this was 1998. As always, solutions were needed in the near term, not in a decade or two.
The authors did find a solution. Their v = 30 terabytes/LOM was reduced to v = 2.5 terabytes/LOM.
In the author’s words:
By careful study of the data, we discovered two properties that could significantly reduce this volume. First, instead of capturing each telemetry measurement, by only capturing when the measurement changed value – we could reduce the volume by almost 3-to-1. Second, we recognized that roughly 100 parameters changed most often (i.e., high frequency parameters) and caused the largest volume of the “change” records. By averaging these parameters over some time period, we could still achieve the necessary engineering accuracy while again reducing the volume of records. In total, we reduced the volume of data down to a reasonable 250 records/sec or approximately 2.5 terabytes/LOM.
Two obvious lessons for v = volume
cases:
- Capture only changes in values
- Capture average for rapidly changing values over time (if meets accuracy requirements)
Less obvious lesson:
- Study data carefully to understand its properties relative to your requirements.
Studying, understanding and capturing your understanding of your data will benefit you and subsequent researchers working with the same data.
Whether your v = volume
is the same as mine or not.
Quotes are from: “A Queriable Repository for HST Telemetry Data, a Case Study in using Data Warehousing for Science and Engineering” by Joseph A. Pollizzi, III and Karen Lezon, Astronomical Data Analysis Software and Systems VII, ASP Conference Series, Vol. 145, 1998, Editors: R. Albrecht, R. N. Hook and H. A. Bushouse, pp.367-370.
There are other insights and techniques of interest in this article but I leave them for another post.