Facebook unveils Presto engine for querying 250 PB data warehouse by Jordan Novet.
From the post:
At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.
More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.
“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”
Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.
With Presto, however, simple queries can run in a few hundred milliseconds, while more complex ones will run in a few minutes, Traverso said. It runs in memory and never writes to disk, Traverso said.
Traverso goes onto say that Facebook will opensource Presto this coming Fall.
See my prior post on a more technical description of Presto: Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
Bear in mind that getting an answer from 250 PB of data quickly isn’t the same thing as getting a useful answer quickly.