Archive for the ‘Facebook’ Category

Presto is Coming!

Sunday, June 9th, 2013

Facebook unveils Presto engine for querying 250 PB data warehouse by Jordan Novet.

From the post:

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

With Presto, however, simple queries can run in a few hundred milliseconds, while more complex ones will run in a few minutes, Traverso said. It runs in memory and never writes to disk, Traverso said.

Traverso goes onto say that Facebook will opensource Presto this coming Fall.

See my prior post on a more technical description of Presto: Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.

Bear in mind that getting an answer from 250 PB of data quickly isn’t the same thing as getting a useful answer quickly.

Under the Hood: The entities graph

Sunday, June 9th, 2013

Under the Hood: The entities graph (Eric Sun is a tech lead on the entities team, and Venky Iyer is an engineering manager on the entities team.)

From the post:

Facebook’s social graph now comprises over 1 billion monthly active users, 600 million of whom log in every day. What unites each of these people is their social connections, and one way we map them is by traversing the graph of their friendships.

entity graph

But this is only a small portion of the connections on Facebook. People don’t just have connections to other people—they may use Facebook to check in to restaurants and other points of interest, they might show their favorite books and movies on their timeline, and they may also list their high school, college, and workplace. These 100+ billion connections form the entity graph.

There are even connections between entities themselves: a book has an author, a song has an artist, and movies have actors. All of these are represented by different kinds of edges in the graph, and the entities engineering team at Facebook is charged with building, cleaning, and understanding this graph.

Instructive read on building an entity graph.

Differs from NSA data churning in several important ways:

  1. The participants want their data to be found with like data. Participants generally have no motive to lie or hide.
  2. The participants seek out similar users and data.
  3. The participants correct bad data for the benefit of others.

None of those characteristics can be attributed to the victims of NSA data collection efforts.

LinkBench [Graph Benchmark]

Tuesday, April 2nd, 2013

LinkBench

From the webpage:

LinkBench Overview

LinkBench is a database benchmark developed to evaluate database performance for workloads similar to those of Facebook’s production MySQL deployment. LinkBench is highly configurable and extensible. It can be reconfigured to simulate a variety of workloads and plugins can be written for benchmarking additional database systems.

LinkBench is released under the Apache License, Version 2.0.

Background

One way of modeling social network data is as a social graph, where entities or nodes such as people, posts, comments and pages are connected by links which model different relationships between the nodes. Different types of links can represent friendship between two users, a user liking another object, ownership of a post, or any relationship you like. These nodes and links carry metadata such as their type, timestamps and version numbers, along with arbitrary payload data.

Facebook represents much of its data in this way, with the data stored in MySQL databases. The goal of LinkBench is to emulate the social graph database workload and provide a realistic benchmark for database performance on social workloads. LinkBench’s data model is based on the social graph, and LinkBench has the ability to generate a large synthetic social graph with key properties similar to the real graph. The workload of database operations is based on Facebook’s production workload, and is also generated in such a way that key properties of the workload match the production workload.

A benchmark for testing your graph database performance!

Additional details at: LinkBench: A database benchmark for the social graph by Tim Armstrong.

I first saw this in a tweet by Stefano Bertolo.

Under the Hood: Building out the infrastructure for Graph Search

Monday, March 25th, 2013

Under the Hood: Building out the infrastructure for Graph Search by Sriram Sankar, Soren Lassen, and Mike Curtiss.

From the post:

In the early days, Facebook was as much about meeting new people as keeping in touch with people you already knew at your college. Over time, Facebook became more about maintaining connections. Graph Search takes us back to our roots and helps people make new connections–this time with people, places, and interests.

With this history comes several old search systems that we had to unify in order to build Graph Search. At first, the old search on Facebook (called PPS) was keyword based–the searcher entered keywords and the search engine produced a results page that was personalized and could be filtered to focus on specific kinds of entities such as people, pages, places, groups, etc.

Entertaining overview of the development of the graph solution for Facebook.

Moreover, reassurance if you are worried about “scaling” for your graph application. ;-)

I first saw this at: This Week’s Links by Trevor Landau.

Facebook Graph Search with Cypher and Neo4j

Thursday, January 31st, 2013

Facebook Ggraph Search with Cypher and Neo4j by Max De Marzi.

From the post:

Facebook Graph Search has given the Graph Database community a simpler way to explain what it is we do and why it matters. I wanted to drive the point home by building a proof of concept of how you could do this with Neo4j. However, I don’t have six months or much experience with NLP (natural language processing). What I do have is Cypher. Cypher is Neo4j’s graph language and it makes it easy to express what we are looking for in the graph. I needed a way to take “natural language” and create Cypher from it. This was going to be a problem.

If you think about “likes” as an association type with role players….

Of course, “like” paints with a broad brush but it is a place to start.

Facebook for Topic Maps

Thursday, November 15th, 2012

Did you know there is a Facebook page for topic maps?

Sad thing is, it is missing a lot of people who are interested in topic maps!

When you are on Facebook, take a look at: Topic Maps.

Inge Eivind Henriksen started the group.

Don’t by shy about posting your thoughts, questions, suggestions, etc.

WolframAlpha Launches Personal Analytics for Facebook

Saturday, September 1st, 2012

WolframAlpha Launches Personal Analytics for Facebook by Kim Rees.

From the post:

WolframAlpha has launched its Personal Analytics for Facebook [wolframalpha.com] functionality. Simply type “facebook report” into the query box, authorize the app, and view the extensive analysis of your social network. The report shows you details about when you post, what types of things you post, the apps you use, who comments the most on your posts, your most popular images, and the structure of your friend network. You can easily share or embed sections of your report.

The report is incredibly detailed. You can drill down further into most sections. Any item of significance such as names and dates can be clicked to search for more information. It was interesting to find out that I was born under a waning crescent moon (is there anything Stephen Wolfram doesn’t know?!). I don’t use Facebook much, but this service makes Facebook fun again.

How would you contrast the ease of use factor of visual drill down with the ASCII art style of Cypher in Neo4j?

What user communities would prefer one over the other?

NeoSocial: Connecting to Facebook with Neo4j

Friday, August 17th, 2012

NeoSocial: Connecting to Facebook with Neo4j by Max De Marzi.

From the post:

(Really cool graphic omitted – see below)

Social applications and Graph Databases go together like peanut butter and jelly. I’m going to walk you through the steps of building an application that connects to Facebook, pulls your friends and likes data and visualizes it. I plan on making a video of me coding it one line at a time, but for now let’s just focus on the main elements.

The application will have two major components:

  1. A web service that handles authentication and displaying of friends, likes, and so-on.
  2. A background job service that imports data from Facebook.

We will be deploying this application on Heroku and making use of the RedisToGo and Neo4j Add-ons.

A very good weekend project for Facebook and Neo4j.

I have a different solution when you have too many friends to make a nice graphic (> 50):

Get a bigger monitor. ;-)

Facebook Search- The fall of the machines

Sunday, April 15th, 2012

Facebook Search- The fall of the machines by Ajay Ohri.

Ajay gives five numbered reasons and then one more for preferring Facebook searching.

I hardly ever visit Facebook (I do have an account) and certainly don’t search using it.

But we could trade stories, rumors, etc. all day.

How would we test Facebook versus other search engines?

Or for that matter, how would we test search engines in general?

When we say search A got a “better” result using search engine Z, by what measure do we mean “better?”

Everyone has a Graph Store

Monday, February 27th, 2012

Everyone has a Graph Store by Danny Ayers.

Try this thought experiment.

For practical purposes we often assume that everyone has a computer, a reasonable Internet connection and a modern Web browser. We know it’s an inaccurate assumption, but it provides conceptual targets for technology in terms of people and environment.

Ok, now add to that list a Graph Store: a flexible database to which information can easily be added, and which can be easily queried. The data can also be easily shared over the Cloud. The data is available for any applications that might want to use it. The database is schemaless, agnostic about what you put in it: the data could be about contacts, descriptions of people & their relationships (i.e. a Social Graph), it could be about places or events, products, technical information, whatever. It can contain private information, it can contain information that you’re happy to share. You control your own store and can let other people access as much or as little of its contents as you like (which they can do easily over the cloud). You can access other people’s store in the same way, according to their preferences. It’s both a Personal Knowledgebase and a Federated Public Knowledgebase.

So, make the assumption: everyone has a Graph Store. Now what do you want to do with yours? What can your friends and colleagues do with theirs? How can you use other peoples information to improve your quality of life, and vice versa? What new tools can be developed to help them take advantage of their stores? How can you get rich quick on this? What other questions are there..?

When I do this thought experiment, all I come up with is Facebook. So I am not very encouraged.

Perhaps Danny is expecting a natural clumping of useful comments and insights. Certainly is possible but then clumpings around Jim Jones and Jimmy Swaggart are also possible.

Or that a process of collective commenting and consideration will lead to useful results. American Idol isn’t strong evidence that mass participation produces good results. Or American election results.

Your thought experiment results may vary so feel free to report them.

Graphs are a great idea. Asking everyone to write down their thoughts in a graph store, not so great.

EMC Greenplum puts a social spin on big data

Thursday, December 15th, 2011

EMC Greenplum puts a social spin on big data

From the post:

Greenplum, the analytics division of EMC, has announced new software that lets data analysts explore all their organization’s data and share interesting findings and data sets Facebook-style among their colleagues. The product is called Chorus, and it wraps around EMC’s Greenplum Database and Hadoop distribution, making all that data available for the data team work with.

The pitch here is about unifying the analytic database and Hadoop environments and making it as easy and collaborative as possible to work with data, since EMC thinks a larger percentage of employees will have to figure out how to analyze business data. Plus, because EMC doesn’t have any legacy database or business intelligence products to protect, the entire focus of the Greenplum division is on providing the best big-data experience possible.

From the Chorus product page:

Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from anywhere in the organization. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data. Customers deploy Chorus to create a self-service agile analytic infrastructure; teams can create workspaces on the fly with self-service provisioning, and then instantly start creating and sharing insights.

Chorus breaks down the walls between all of the individuals involved in the data science team and empowers everyone who works with your data to more easily collaborate and derive insight from that data.

Note to EMC Greenplum: If you want people to at least consider products, don’t hide them so that searching is necessary to find them. Just an FYI.

Resources is pretty thin but better than the blah-blah “more information page.” Could have more details, perhaps a demo version?

A button that says “Contact Sales” makes me loose interest real quick. I don’t need some software sales person pinging me during an editing cycle to know if I have installed the “free” software yet and am I ready to order? Buying software really should be on my schedule, not his/hers. Yes?

Extracting data from the Facebook social graph with expressor, a Tutorial

Monday, December 12th, 2011

Extracting data from the Facebook social graph with expressor, a Tutorial by Michael Tarallo.

From the post:

In my last article,Enterprise Application Integration with Social Networking Data, I describe how social networking sites, such as Facebook and Twitter, provide APIs to communicate with the various components available in these applications. One in particular, is their “social graph” API which enables software developers to create programs that can interface with the many “objects” stored within these graphs.

In this article, I will briefly review the Facebook social graph and provide a simple tutorial with an expressor downloadable project. I will cover how expressor can extract data using the Facebook graph API and flatten it by using the provided reusable Datascript Module. I will also demonstrate how to add new user defined attributes to the expressor Dataflow so one can customize the output needed.

Looks interesting.

Seems appropriate after starting today’s posts with work on the ODP files.

As you know, I am not a big fan of ETL but it has been a survivor. And if the folks who are signing off on the design want ETL, maybe it isn’t all that weird. ;-)