Sketching and streaming algorithms for processing massive data by Jelani Nelson.
From the introduction:
Several modern applications require handling data so massive that traditional algorithmic models do not provide accurate means to design and evaluate efficient algorithms. Such models typically assume that all data fits in memory, and that running time is accurately modeled as the number of basic instructions the algorithm performs. However in applications such as online social networks, large-scale modern scientific experiments, search engines, online content delivery, and product and consumer tracking for large retailers such as Amazon and Walmart, data too large to fit in memory must be analyzed. This consideration has led to the development of several models for processing such large amounts of data: The external memory model [1, 2] and cache-obliviousness [3, 4], where one aims to minimize the number of blocks fetched from disk; property testing [5], where it is assumed the data is so massive that we do not wish to even look at it all and thus aim to minimize the number of probes made into the data; and massively parallel algorithms operating in such systems as MapReduce and Hadoop [6, 7]. Also in some applications, data arrives in a streaming fashion and must be processed on the fly. Such cases arise, for example, with packet streams in network traffic monitoring, or query streams arriving at a Web-based service such as a search engine.
In this article we focus on this latter streaming model of computation, where a given algorithm must make one pass over a data set to then compute some function. We pursue such streaming algorithms, which use memory that is sublinear in the amount of data, since we assume the data is too large to fit in memory. Sometimes it becomes useful to consider algorithms that are allowed not just one, but a few passes over the data, in cases where the data set lives on disk and the number of passes may dominate the overall running time. We also occasionally discuss sketches. A sketch is with respect to some function f, and a sketch of a data set x is a compressed representation of x from which one can compute f(x). Of course under this definition f(x) is itself a valid sketch of x, but we often require more of our sketch than just being able to compute f(x). For example, we typically require that it should be possible for the sketch to be updated as more data arrives, and sometimes we also require sketches of two different data sets that are prepared independently can be compared to compute some function of the aggregate data, or similarity or difference measures across different data sets.
Our goal in this article is not to be comprehensive in our coverage of streaming algorithms. Rather, we discuss in some detail a few surprising results in order to convince the reader that it is possible to obtain some non-trivial algorithms within this model. Those interested in learning more about this area are encouraged to read the surveys [8, 9], or view the notes online for streaming courses taught by Chakrabarti at Dartmouth [10], Indyk at MIT [11], and McGregor at UMass Amherst [12].
Projections of data growth are outdated nearly as soon as they are uttered.
Suffice it to say that whatever data we are called upon to process today, will be larger next year. How much larger depends on the domain, the questions to be answered and a host of other factors. But it will be larger.
We need to develop methods of subject recognition, when like Heraclitus, we cannot ever step in the same stream twice.
If we miss it on the first pass, there isn’t going to be a second one. Next stop for some data streams is going to be /dev/null.
What approaches are you working on?