Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics by Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil.
Abstract:
MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one ineffcient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.
I always hope when I see first-class citizen(s) in CS papers that it is going to be talking about data structures and/or metadata (hopefully both).
Alas, I was disappointed once again but the paper is an interesting one and will repay close study.
Oh, the reason I mention treating data structures and metadata as first class citizens is then I can avoid the my way, your way or the highway sort of choices when it comes to metadata and formats.
Granted some formats maybe easier to use on some contexts, such as HDF5 (for space data), FITS (astronomical images), XML (for data and documents) or COBOL (for financial transactions), but if I can see formats as first class citizens, then I can map between them.
Not in a conversion sense, I can see them as though they are the same format as I prefer. Extract data from them, write data to them, etc.