Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 9, 2015

Exposure to Diverse Information on Facebook [Skepticism]

Filed under: Facebook,News,Opinions,Social Media,Social Networks,Social Sciences — Patrick Durusau @ 3:06 pm

Exposure to Diverse Information on Facebook by Eytan Bakshy, Solomon Messing, Lada Adamicon.

From the post:

As people increasingly turn to social networks for news and civic information, questions have been raised about whether this practice leads to the creation of “echo chambers,” in which people are exposed only to information from like-minded individuals [2]. Other speculation has focused on whether algorithms used to rank search results and social media posts could create “filter bubbles,” in which only ideologically appealing content is surfaced [3].

Research we have conducted to date, however, runs counter to this picture. A previous 2012 research paper concluded that much of the information we are exposed to and share comes from weak ties: those friends we interact with less often and are more likely to be dissimilar to us than our close friends [4]. Separate research suggests that individuals are more likely to engage with content contrary to their own views when it is presented along with social information [5].

Our latest research, released today in Science, quantifies, for the first time, exactly how much individuals could be and are exposed to ideologically diverse news and information in social media [1].

We found that people have friends who claim an opposing political ideology, and that the content in peoples’ News Feeds reflect those diverse views. While News Feed surfaces content that is slightly more aligned with an individual’s own ideology (based on that person’s actions on Facebook), who they friend and what content they click on are more consequential than the News Feed ranking in terms of how much diverse content they encounter.

The Science paper: Exposure to Ideologically Diverse News and Opinion

The definition of an “echo chamber” is implied in the authors’ conclusion:


By showing that people are exposed to a substantial amount of content from friends with opposing viewpoints, our findings contrast concerns that people might “list and speak only to the like-minded” while online [2].

The racism of the Deep South existed in spite of interaction between whites and blacks. So “echo chamber” should not be defined as association of like with like, at least not entirely. The Deep South was a echo chamber of racism but not for a lack of diversity in social networks.

Besides lacking a useful definition of “echo chamber,” the author’s ignore the role of confirmation bias (aka “backfire effect”) when confronted with contrary thoughts or evidence. To some readers seeing a New York Times editorial disagreeing with their position, can make them feel better about being on the “right side.”

That people are exposed to diverse information on Facebook is interesting, but until there is a meaningful definition of “echo chambers,” the role Facebook plays in the maintenance of “echo chambers” remains unknown.

March 16, 2015

Bias? What Bias?

Filed under: Bias,Facebook,Social Media,Social Sciences,Twitter — Patrick Durusau @ 6:09 pm

Scientists Warn About Bias In The Facebook And Twitter Data Used In Millions Of Studies by Brid-Aine Parnell.

From the post:

Social media like Facebook and Twitter are far too biased to be used blindly by social science researchers, two computer scientists have warned.

Writing in today’s issue of Science, Carnegie Mellon’s Juergen Pfeffer and McGill’s Derek Ruths have warned that scientists are treating the wealth of data gathered by social networks as a goldmine of what people are thinking – but frequently they aren’t correcting for inherent biases in the dataset.

If folks didn’t already know that scientists were turning to social media for easy access to the pat statistics on thousands of people, they found out about it when Facebook allowed researchers to adjust users’ news feeds to manipulate their emotions.

Both Facebook and Twitter are such rich sources for heart pounding headlines that I’m shocked, shocked that anyone would suggest there is bias in the data! 😉

Not surprisingly, people participate in social media for reasons entirely of their own and quite unrelated to the interests or needs of researchers. Particular types of social media attract different demographics than other types. I’m not sure how you could “correct” for those biases, unless you wanted to collect better data for yourself.

Not that there are any bias free data sets but some are so obvious that it hardly warrants mentioning. Except that institutions like the Brookings Institute bump and grind on Twitter data until they can prove the significance of terrorist social media. Brookings knows better but terrorism is a popular topic.

Not to make data carry all the blame, the test most often applied to data is:

Will this data produce a result that merits more funding and/or will please my supervisor?

I first saw this in a tweet by Persontyle.

March 6, 2015

Airbnb open sources SQL tool built on Facebook’s Presto database

Filed under: Facebook,Presto,SQL — Patrick Durusau @ 4:25 pm

Airbnb open sources SQL tool built on Facebook’s Presto database by Derrick Harris.

From the post:

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

From the GitHub page:

Airpal is a web-based, query execution tool which leverages Facebook’s PrestoDB to make authoring queries and retrieving results simple for users. Airpal provides the ability to find tables, see metadata, browse sample rows, write and edit queries, then submit queries all in a web interface. Once queries are running, users can track query progress and when finished, get the results back through the browser as a CSV (download it or share it with friends). The results of a query can be used to generate a new Hive table for subsequent analysis, and Airpal maintains a searchable history of all queries run within the tool.

Features

  • Optional Access Control
  • Syntax highlighting
  • Results exported to a CSV for download or a Hive table
  • Query history for self and others
  • Saved queries
  • Table finder to search for appropriate tables
  • Table explorer to visualize schema of table and first 1000 rows

Requirements

  • Java 7 or higher
  • MySQL database
  • Presto 0.77 or higher
  • S3 bucket (to store CSVs)
  • Gradle 2.2 or higher

I understand to some degree the need to make SQL “simpler” but fail to see how simpler controls translate into a solution. The controls may be obvious enough but if I don’t know the semantics of the column headers, the simplicity of the interface won’t be terribly helpful.

Or to put it another way, users seem to be assumed to know the semantics of the tables they encounter. True/False?

January 17, 2015

Facebook open sources tools for bigger, faster deep learning models

Filed under: Artificial Intelligence,Deep Learning,Facebook,Machine Learning — Patrick Durusau @ 6:55 pm

Facebook open sources tools for bigger, faster deep learning models by Derrick Harris.

From the post:

Facebook on Friday open sourced a handful of software libraries that it claims will help users build bigger, faster deep learning models than existing tools allow.

The libraries, which Facebook is calling modules, are alternatives for the default ones in a popular machine learning development environment called Torch, and are optimized to run on Nvidia graphics processing units. Among the modules are those designed to rapidly speed up training for large computer vision systems (nearly 24 times, in some cases), to train systems on potentially millions of different classes (e.g., predicting whether a word will appear across a large number of documents, or whether a picture was taken in any city anywhere), and an optimized method for building language models and word embeddings (e.g., knowing how different words are related to each other).

“‘[T]here is no way you can use anything existing” to achieve some of these results, said Soumith Chintala, an engineer with Facebook Artificial Intelligence Research.

How very awesome! Keeping abreast of the latest releases and papers on deep learning is turning out to be a real chore. Enjoyable but a time sink none the less.

Derrick’s post and the release from Facebook have more details.

Apologies for the “lite” posting today but I have been proofing related specifications where one defines a term and the other uses the term, but doesn’t cite the other specification’s definition or give its own. Do those mean the same thing? Probably the same thing but users outside the process may or may not realize that. Particularly in translation.

I first saw this in a tweet by Kirk Borne.

December 14, 2014

Everything You Need To Know About Social Media Search

Filed under: Facebook,Instagram,Social Media,Twitter — Patrick Durusau @ 7:07 pm

Everything You Need To Know About Social Media Search by Olsy Sorokina.

From the post:

For the past decade, social networks have been the most universally consistent way for us to document our lives. We travel, build relationships, accomplish new goals, discuss current events and welcome new lives—and all of these events can be traced on social media. We have created hashtags like #ThrowbackThursday and apps like Timehop to reminisce on all the past moments forever etched in the social web in form of status updates, photos, and 140-character phrases.

Major networks demonstrate their awareness of the role they play in their users’ lives by creating year-end summaries such as Facebook’s Year in Review, and Twitter’s #YearOnTwitter. However, much of the emphasis on social media has been traditionally placed on real-time interactions, which often made it difficult to browse for past posts without scrolling down for hours on end.

The bias towards real-time messaging has changed in a matter of a few days. Over the past month, three major social networks announced changes to their search functions, which made finding old posts as easy as a Google search. If you missed out on the news or need a refresher, here’s everything you need to know.

I suppose Olsy means in addition to search in general sucking.

Interested tidbit on Facebook:


This isn’t Facebook’s first attempt at building a search engine. The earlier version of Graph Search gave users search results in response to longer-form queries, such as “my friends who like Game of Thrones.” However, the semantic search never made it to the mobile platforms; many supposed that using complex phrases as search queries was too confusing for an average user.

Does anyone have any user research on the ability of users to use complex phrases as search queries?

I ask because if users have difficulty authoring “complex” semantics and difficulty querying with “complex” semantics, it stands to reason they may have difficulty interpreting “complex” semantic results. Yes?

If all three of those are the case, then how do we impart the value-add of “complex” semantics without tripping over one of those limitations?

Osly also covers Instagram and Twitter. Twitter’s advanced search looks like the standard include/exclude, etc. type of “advanced” search. “Advanced” maybe forty years ago in the early OPACs but not really “advanced” now.

Catch up on these new search features. They will provide at least a minimum of grist for your topic map mill.

October 29, 2014

Introducing osquery

Filed under: Facebook,osquery — Patrick Durusau @ 4:03 pm

Introducing osquery by Mike Arpaia.

From the post:

Maintaining real-time insight into the current state of your infrastructure is important. At Facebook, we’ve been working on a framework called osquery which attempts to approach the concept of low-level operating system monitoring a little differently.

Osquery exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems. With osquery, SQL tables represent the current state of operating system attributes, such as:

  • running processes
  • loaded kernel modules
  • open network connections

SQL tables are implemented via an easily extendable API. Several tables already exist and more are being written. To best understand the expressiveness that is afforded to you by osquery, consider the following examples….

I haven’t installed osquery, yet, but suspect that most of the data it collects is available now through a variety of admin tools. But not through a single tool that enables you to query across tables to combine that data. That is the part that intrigues me.

Code and documentation on Github.

May 12, 2014

Facebook teaches you exploratory data analysis with R

Filed under: Data Analysis,Exploratory Data Analysis,Facebook,R — Patrick Durusau @ 6:44 pm

Facebook teaches you exploratory data analysis with R by David Smith.

From the post:

Facebook is a company that deals with a lot of data — more than 500 terabytes a day — and R is widely used at Facebook to visualize and analyze that data. Applications of R at Facebook include user behaviour, content trends, human resources and even graphics for the IPO prospectus. Now, four R users at Facebook (Moira Burke, Chris Saden, Dean Eckles and Solomon Messing) share their experiences using R at Facebook in a new Udacity on-line course, Exploratory Data Analysis.

The more data you explore, the better data explorer you will be!

Enjoy!

I first saw this in a post by David Smith.

May 4, 2014

“Credibility” As “Google Killer”?

Filed under: Facebook,Relevance,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 6:26 pm

Nancy Baym tweets: “Nice article on flaws of ”it’s not our fault, it’s the algorithm” logic from Facebook with quotes from @TarletonG” pointing to: Facebook draws fire on ‘related articles’ push.

From the post:

A surprise awaited Facebook users who recently clicked on a link to read a story about Michelle Obama’s encounter with a 10-year-old girl whose father was jobless.

Facebook responded to the click by offering what it called “related articles.” These included one that alleged a Secret Service officer had found the president and his wife having “S*X in Oval Office,” and another that said “Barack has lost all control of Michelle” and was considering divorce.

A Facebook spokeswoman did not try to defend the content, much of which was clearly false, but instead said there was a simple explanation for why such stories are pushed on readers. In a word: algorithms.

The stories, in other words, apparently are selected by Facebook based on mathematical calculations that rely on word association and the popularity of an article. No effort is made to vet or verify the content.

Facebook’s explanation, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. (emphasis added)

Just imagine the hue and outcry had that last line read:

Imaginary Quote Google’s explanation of search results, however, is drawing sharp criticism from experts who said the company should immediately suspend its practice of pushing so-called related articles to unsuspecting users unless it can come up with a system to ensure that they are credible. End Imaginary Quote

Is demanding “credibility” of search results the long sought after “Google Killer?”

“Credibility” is closely related to the “search” problem but I think it should be treated separately from search.

In part because the “credibility” question is one that can require multiple searches upon the author of search result content, searches for reviews and comments on search result content, searches of other sources of data on the content in the search result and then a collation of that additional content to make a credibility judgement on the search result content. The procedure isn’t always that elaborate but the main point is that it requires additional searching and evaluation of content to even begin to answer a credibility question.

Not to mention why the information is being sought has a bearing on credibility. If I want to find examples of nutty things said about President Obama to cite, then finding the cases mentioned above is not only relevant (the search question) but also “credible” in the sense that Facebook did not make they up. They are published nutty statements about the current President.

What if a user wanted to search for “coffee and bagels?” The top hit on one popular search engine today is: Coffee Meets Bagel: Free Online Dating Sites, along with numerous other links to information on the first link. Was this relevant to my search? No, but search results aren’t always predictable. They are relevant to someone’s search using “coffee and bagels.”

It is the responsibility of every reader to decide for themselves what is relevant, credible, useful, etc. in terms of content, whether it is hard copy or digital.

Any other solution takes us to Plato‘s Republic, which was great to read about, would not want to live there.

April 12, 2014

Faceboook Gets Smarter with Graph Engine Optimization

Filed under: Facebook,Giraph,Graphs — Patrick Durusau @ 7:07 pm

Faceboook Gets Smarter with Graph Engine Optimization by Alex Woodie.

From the post:

Last fall, the folks in Facebook’s engineering team talked about how they employed the Apache Giraph engine to build a graph on its Hadoop platform that can host more than a trillion edges. While the Graph Search engine is capable of massive graphing tasks, there were some workloads that remained outside the company’s technical capabilities–until now.

Facebook turned to the Giraph engine to power its new Graph Search offering, which it unveiled in January 2013 as a way to let users perform searches on other users to determine, for example, what kind of music their Facebook friends like, what kinds of food they’re into, or what activities they’ve done recently. An API for Graph Search also provides advertisers with a new revenue source for Facebook. It’s likely the world’s largest graph implementation, and a showcase of what graph engines can do.

The company picked Giraph because it worked on their existing Hadoop implementation, including HDFS and its MapReduce infrastructure stack (known as Corona). Compared to running the computation workload on Hive, an internal Facebook test of a 400-billion edge graph ran 126x faster on Giraph, and had a 26x performance advantage, as we explained in a Datanami story last year.

When Facebook scaled its internal test graph up to 1 trillion edges, they were able to keep the processing of each iteration of the graph under four minutes on a 200-server cluster. That amazing feat was done without any optimization, the company claimed. “We didn’t cheat,” Facebook developer Avery Ching declared in a video. “This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”

High level view with technical references on how Facebook is optimizing its Apache Giraph engine.

If you are interested in graphs, this is much more of a real world scenario than building “big” graphs out of uniform time slices.

March 27, 2014

WebScaleSQL

Filed under: Facebook,MySQL — Patrick Durusau @ 3:58 pm

WebScaleSQL

From the webpage:

What is WebScaleSQL?

WebScaleSQL is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale, and seek greater performance from a database technology tailored for their needs.

Our goal in launching WebScaleSQL is to enable the scale-oriented members of the MySQL community to work more closely together in order to prioritize the aspects that are most important to us. We aim to create a more integrated system of knowledge-sharing to help companies leverage the great features already found in MySQL 5.6, while building and adding more features that are specific to deployments in large scale environments. In the last few months, engineers from all four companies have contributed code and provided feedback to each other to develop a new, more unified, and more collaborative branch of MySQL.

But as effective as this collaboration has been so far, we know we’re not the only ones who are trying to solve these particular challenges. So we will keep WebScaleSQL open as we go, to encourage others who have the scale and resources to customize MySQL to join in our efforts. And of course we will welcome input from anyone who wants to contribute, regardless of what they are currently working on.

Who is behind WebScaleSQL?

WebScaleSQL currently includes contributions from MySQL engineering teams at Facebook, Google, LinkedIn, and Twitter. Together, we are working to share a common base of code changes to the upstream MySQL branch that we can all use and that will be made available via open source. This collaboration will expand on existing work by the MySQL community, and we will continue to track the upstream branch that is the latest, production-ready release (currently MySQL 5.6).

Correct me if I’m wrong but don’t teams from Facebook, Google, LinkedIn and Twitter know a graph when they see one? 😉

Even people who recognize graphs may need an SQL solution every now and again. Besides, solutions should not drive IT policy.

Requirements and meeting those requirements should drive IT policy. You are less likely to own very popular, expensive and ineffectual solutions when requirements rule. (Even iterative requirements in the agile approach are requirements.)

A reminder that MySQL/WebScaleSQL compiles from source with:

A working ANSI C++ compiler. GCC 4.2.1 or later, Sun Studio 10 or later, Visual Studio 2008 or later, and many current vendor-supplied compilers are known to work. (INSTALL-SOURCE)

Which makes it a target, sorry, subject for analysis of any vulnerabilities with joern.

I first saw this in a post by Derrick Harris, Facebook — with help from Google, LinkedIn, Twitter — releases MySQL built to scale.

November 24, 2013

Under the Hood: [of RocksDB]

Filed under: Facebook,Key-Value Stores,leveldb,RocksDB — Patrick Durusau @ 2:33 pm

Under the Hood: Building and open-sourcing RocksDB by Dhruba Borthakur.

From the post:

Every time one of the 1.2 billion people who use Facebook visits the site, they see a completely unique, dynamically generated home page. There are several different applications powering this experience–and others across the site–that require global, real-time data fetching.

Storing and accessing hundreds of petabytes of data is a huge challenge, and we’re constantly improving and overhauling our tools to make this as fast and efficient as possible. Today, we are open-sourcing RocksDB, an embeddable, persistent key-value store for fast storage that we built and use here at Facebook.

Why build an embedded database?

Applications traditionally access their data via remote procedure calls over a network connection, but that can be slow–especially when we need to power user-facing products in real time. With the advent of flash storage, we are starting to see newer applications that can access data quickly by managing their own dataset on flash instead of accessing data over a network. These new applications are using what we call an embedded database.

There are several reasons for choosing an embedded database. When database requests are frequently served from memory or from very fast flash storage, network latency can slow the query response time. Accessing the network within a data center can take about 50 microseconds, as can fast-flash access latency. This means that accessing data over a network could potentially be twice as slow as an application accessing data locally.

Secondly, we are starting to see servers with an increasing number of cores and with storage-IOPS reaching millions of requests per second. Lock contention and a high number of context switches in traditional database software prevents it from being able to saturate the storage-IOPS. We’re finding we need new database software that is flexible enough to be customized for many of these emerging hardware trends.

Like most of you, I don’t have 1.2 billion people visiting my site. 😉

However, understanding today’s “high-end” solutions will prepare you for tomorrow’s “middle-tier” solution and day after tomorrow’s desktop solution.

A high level overview of RocksDB.

Other resources to consider:

RocksDB Facebook page.

RocksDB on Github.


Update: Igor Canadi has posted to the Facebook page a proposal to add the concept of ColumnFamilies to RocksDB. https://github.com/facebook/rocksdb/wiki/Column-Families-proposal Comments? (Direct comments on that proposal to the Facebook page for RocksDB.)

November 10, 2013

Are You A Facebook Slacker? (Or, “Don’t “Like” Me, Support Me!”)

Filed under: Facebook,Marketing,Psychology,Social Media — Patrick Durusau @ 8:09 pm

Their title reads: The Nature of Slacktivism: How the Social Observability of an Initial Act of Token Support Affects Subsequent Prosocial Action by Kirk Kristofferson, Katherine White, John Peloza. (Kirk Kristofferson, Katherine White, John Peloza. The Nature of Slacktivism: How the Social Observability of an Initial Act of Token Support Affects Subsequent Prosocial Action. Journal of Consumer Research, 2013; : 000 DOI: 10.1086/674137)

Abstract:

Prior research offers competing predictions regarding whether an initial token display of support for a cause (such as wearing a ribbon, signing a petition, or joining a Facebook group) subsequently leads to increased and otherwise more meaningful contributions to the cause. The present research proposes a conceptual framework elucidating two primary motivations that underlie subsequent helping behavior: a desire to present a positive image to others and a desire to be consistent with one’s own values. Importantly, the socially observable nature (public vs. private) of initial token support is identified as a key moderator that influences when and why token support does or does not lead to meaningful support for the cause. Consumers exhibit greater helping on a subsequent, more meaningful task after providing an initial private (vs. public) display of token support for a cause. Finally, the authors demonstrate how value alignment and connection to the cause moderate the observed effects.

From the introduction:

We define slacktivism as a willingness to perform a relatively costless, token display of support for a social cause, with an accompanying lack of willingness to devote significant effort to enact meaningful change (Davis 2011; Morozov 2009a).

From the section: The Moderating Role of Social Observability: The Public versus Private Nature of Support:

…we anticipate that consumers who make an initial act of token support in public will be no more likely to provide meaningful support than those who engaged in no initial act of support.

Four (4) detailed studies and an extensive review of the literature are offered to support the author’s conclusions.

The only source that I noticed missing was:

10 Two men went up into the temple to pray; the one a Pharisee, and the other a publican.

11 The Pharisee stood and prayed thus with himself, God, I thank thee, that I am not as other men are, extortioners, unjust, adulterers, or even as this publican.

12 I fast twice in the week, I give tithes of all that I possess.

13 And the publican, standing afar off, would not lift up so much as his eyes unto heaven, but smote upon his breast, saying, God be merciful to me a sinner.

14 I tell you, this man went down to his house justified rather than the other: for every one that exalteth himself shall be abased; and he that humbleth himself shall be exalted.

King James Version, Luke 18: 10-14.

The authors would reverse the roles of the Pharisee and the publican, to find the Pharisee contributes “meaningful support,” and the publican has not.

We contrast token support with meaningful support, which we define as consumer contributions that require a significant cost, effort, or behavior change in ways that make tangible contributions to the cause. Examples of meaningful support include donating money and volunteering time and skills.

If you are trying to attract “meaningful support” for your cause or organization, i.e., avoid slackers, there is much to learn here.

If you are trying to move beyond the “cheap grace” (Bonhoeffer)* of “meaningful support” and towards “meaningful change,” there is much to be learned here as well.

Governments, corporations, ad agencies and even your competitors are manipulating the public understanding of “meaningful support” and “meaningful change.” And acceptable means for both.

You can play on their terms and lose, or you can define your own terms and roll the dice.

Questions?


* I know the phrase “cheap grace” from Bonhoeffer but in running a reference to ground, I saw a statement in Wikipedia that Bonhoeffer learned that phrase from Adam Clayton Powell, Sr.. Homiletics have never been a strong interest of mine but I will try to run down some sources on sermons by Adam Clayton Powell, Sr.

November 8, 2013

Facebook’s Presto 10X Hive Speed (mostly)

Filed under: Facebook,Hive,Presto — Patrick Durusau @ 5:59 pm

Facebook open sources its SQL-on-Hadoop engine, and the web rejoices by Derrick Harris.

From the post:

Facebook has open sourced Presto, the interactive SQL-on-Hadoop engine the company first discussed in June. Presto is Facebook’s take on Cloudera’s Impala or Google’s Dremel, and it already has some big-name fans in Dropbox and Airbnb.

Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database.

Presto is 10 times faster than Hive for most queries, according to Facebook software engineer Martin Traverso in a blog post detailing today’s news.

I think my headline is the more effective one. 😉

You won’t know anything until you download Presto, read the documentation, etc.

Presto homepage.

The first job is to get your attention, then you have to get the information necessary to be informed.

From Derrick’s post, which points to other SQL-on-Hadoop options, interesting times are ahead!

September 13, 2013

Scaling Apache Giraph to a trillion edges

Filed under: Facebook,Giraph,GraphLab,Hive — Patrick Durusau @ 6:03 pm

Scaling Apache Giraph to a trillion edges by Avery Ching.

From the post:

Graph structures are ubiquitous: they provide a basic model of entities with connections between them that can represent almost anything. Flight routes connect airports, computers communicate to one another via the Internet, webpages have hypertext links to navigate to other webpages, and so on. Facebook manages a social graph that is composed of people, their friendships, subscriptions, and other connections. Open graph allows application developers to connect objects in their applications with real-world actions (such as user X is listening to song Y).

Analyzing these real world graphs at the scale of hundreds of billions or even a trillion (10^12) edges with available software was impossible last year. We needed a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets. After the improvements described in this article, Apache Giraph provided the solution to our requirements.

In the summer of 2012, we began exploring a diverse set of graph algorithms across many different Facebook products as well as academic literature. We selected a few representative use cases that cut across the problem space with different system bottlenecks and programming complexity. Our diverse use cases and the desired features of the programming framework drove the requirements for our system infrastructure. We required an iterative computing model, graph-based API, and fast access to Facebook data. Based on these requirements, we selected a few promising graph-processing platforms including Apache Hive, GraphLab, and Apache Giraph for evaluation.

For your convenience:

Apache Giraph

Apache Hive

GraphLab

Your appropriate scale is probably less than a trillion edges but everybody likes a great scaling story.

This is a great scaling story.

June 23, 2013

Fun with Facebook in Neo4j [Separation from Edward Snowden?]

Filed under: Facebook,Graphs,Neo4j — Patrick Durusau @ 1:13 pm

Fun with Facebook in Neo4j by Rik Van Bruggen.

From the post:

Ever since Facebook promoted its “graph search” methodology, lots of people in our industry have been waking up to the fact that graphs are über-cool. Thanks to the powerful query possibilities, people like Facebook, Twitter, LinkedIn, and let us not forget, Google have been providing us with some of the most amazing technologies. Specifically, the power of the “social network” is tempting many people to get their feet wet, and to start using graph technology. And they should: graphs are fantastic at storing, querying and exploiting social structures, stored in a graph database.

So how would that really work? I am a curious, “want to know” but “not very technical” kind of guy, and I decided to get my hands dirty (again), and try some of this out by storing my own little part of Facebook – in neo4j. Without programming any kind of production-ready system – because I don’t know how – but with enough real world data to make us see what it would be like.

Rik walks you through obtaining data from Facebook, munging it in a spreadsheet and loading it into Neo4j.

Can’t wait for Facebook graph to support degrees of separation from named individuals, like Edward Snowden.

Complete with the intervening people of course.

What’s privacy compared to a media-driven witch hunt for anyone “connected” to the latest “face” on the TV?

If Facebook does that for Snowden, they should do it for NSA chief, Keith Alexander as well.

June 9, 2013

Presto is Coming!

Filed under: Facebook,Hive,Presto — Patrick Durusau @ 5:34 pm

Facebook unveils Presto engine for querying 250 PB data warehouse by Jordan Novet.

From the post:

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

With Presto, however, simple queries can run in a few hundred milliseconds, while more complex ones will run in a few minutes, Traverso said. It runs in memory and never writes to disk, Traverso said.

Traverso goes onto say that Facebook will opensource Presto this coming Fall.

See my prior post on a more technical description of Presto: Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.

Bear in mind that getting an answer from 250 PB of data quickly isn’t the same thing as getting a useful answer quickly.

Under the Hood: The entities graph

Filed under: Entities,Facebook,Graphs — Patrick Durusau @ 1:23 pm

Under the Hood: The entities graph (Eric Sun is a tech lead on the entities team, and Venky Iyer is an engineering manager on the entities team.)

From the post:

Facebook’s social graph now comprises over 1 billion monthly active users, 600 million of whom log in every day. What unites each of these people is their social connections, and one way we map them is by traversing the graph of their friendships.

entity graph

But this is only a small portion of the connections on Facebook. People don’t just have connections to other people—they may use Facebook to check in to restaurants and other points of interest, they might show their favorite books and movies on their timeline, and they may also list their high school, college, and workplace. These 100+ billion connections form the entity graph.

There are even connections between entities themselves: a book has an author, a song has an artist, and movies have actors. All of these are represented by different kinds of edges in the graph, and the entities engineering team at Facebook is charged with building, cleaning, and understanding this graph.

Instructive read on building an entity graph.

Differs from NSA data churning in several important ways:

  1. The participants want their data to be found with like data. Participants generally have no motive to lie or hide.
  2. The participants seek out similar users and data.
  3. The participants correct bad data for the benefit of others.

None of those characteristics can be attributed to the victims of NSA data collection efforts.

April 2, 2013

LinkBench [Graph Benchmark]

Filed under: Benchmarks,Facebook,Graphs — Patrick Durusau @ 10:14 am

LinkBench

From the webpage:

LinkBench Overview

LinkBench is a database benchmark developed to evaluate database performance for workloads similar to those of Facebook’s production MySQL deployment. LinkBench is highly configurable and extensible. It can be reconfigured to simulate a variety of workloads and plugins can be written for benchmarking additional database systems.

LinkBench is released under the Apache License, Version 2.0.

Background

One way of modeling social network data is as a social graph, where entities or nodes such as people, posts, comments and pages are connected by links which model different relationships between the nodes. Different types of links can represent friendship between two users, a user liking another object, ownership of a post, or any relationship you like. These nodes and links carry metadata such as their type, timestamps and version numbers, along with arbitrary payload data.

Facebook represents much of its data in this way, with the data stored in MySQL databases. The goal of LinkBench is to emulate the social graph database workload and provide a realistic benchmark for database performance on social workloads. LinkBench’s data model is based on the social graph, and LinkBench has the ability to generate a large synthetic social graph with key properties similar to the real graph. The workload of database operations is based on Facebook’s production workload, and is also generated in such a way that key properties of the workload match the production workload.

A benchmark for testing your graph database performance!

Additional details at: LinkBench: A database benchmark for the social graph by Tim Armstrong.

I first saw this in a tweet by Stefano Bertolo.

March 25, 2013

Under the Hood: Building out the infrastructure for Graph Search

Filed under: Facebook,Graphs,Networks — Patrick Durusau @ 10:32 am

Under the Hood: Building out the infrastructure for Graph Search by Sriram Sankar, Soren Lassen, and Mike Curtiss.

From the post:

In the early days, Facebook was as much about meeting new people as keeping in touch with people you already knew at your college. Over time, Facebook became more about maintaining connections. Graph Search takes us back to our roots and helps people make new connections–this time with people, places, and interests.

With this history comes several old search systems that we had to unify in order to build Graph Search. At first, the old search on Facebook (called PPS) was keyword based–the searcher entered keywords and the search engine produced a results page that was personalized and could be filtered to focus on specific kinds of entities such as people, pages, places, groups, etc.

Entertaining overview of the development of the graph solution for Facebook.

Moreover, reassurance if you are worried about “scaling” for your graph application. 😉

I first saw this at: This Week’s Links by Trevor Landau.

January 31, 2013

Facebook Graph Search with Cypher and Neo4j

Filed under: Cypher,Facebook,Graphs,Neo4j — Patrick Durusau @ 7:24 pm

Facebook Ggraph Search with Cypher and Neo4j by Max De Marzi.

From the post:

Facebook Graph Search has given the Graph Database community a simpler way to explain what it is we do and why it matters. I wanted to drive the point home by building a proof of concept of how you could do this with Neo4j. However, I don’t have six months or much experience with NLP (natural language processing). What I do have is Cypher. Cypher is Neo4j’s graph language and it makes it easy to express what we are looking for in the graph. I needed a way to take “natural language” and create Cypher from it. This was going to be a problem.

If you think about “likes” as an association type with role players….

Of course, “like” paints with a broad brush but it is a place to start.

November 15, 2012

Facebook for Topic Maps

Filed under: Facebook,Topic Maps — Patrick Durusau @ 3:48 pm

Did you know there is a Facebook page for topic maps?

Sad thing is, it is missing a lot of people who are interested in topic maps!

When you are on Facebook, take a look at: Topic Maps.

Inge Eivind Henriksen started the group.

Don’t by shy about posting your thoughts, questions, suggestions, etc.

September 1, 2012

WolframAlpha Launches Personal Analytics for Facebook

Filed under: Facebook,Graphs,WolframAlpha — Patrick Durusau @ 1:23 pm

WolframAlpha Launches Personal Analytics for Facebook by Kim Rees.

From the post:

WolframAlpha has launched its Personal Analytics for Facebook [wolframalpha.com] functionality. Simply type “facebook report” into the query box, authorize the app, and view the extensive analysis of your social network. The report shows you details about when you post, what types of things you post, the apps you use, who comments the most on your posts, your most popular images, and the structure of your friend network. You can easily share or embed sections of your report.

The report is incredibly detailed. You can drill down further into most sections. Any item of significance such as names and dates can be clicked to search for more information. It was interesting to find out that I was born under a waning crescent moon (is there anything Stephen Wolfram doesn’t know?!). I don’t use Facebook much, but this service makes Facebook fun again.

How would you contrast the ease of use factor of visual drill down with the ASCII art style of Cypher in Neo4j?

What user communities would prefer one over the other?

August 17, 2012

NeoSocial: Connecting to Facebook with Neo4j

Filed under: Facebook,Neo4j,Social Graphs,Social Networks — Patrick Durusau @ 12:29 pm

NeoSocial: Connecting to Facebook with Neo4j by Max De Marzi.

From the post:

(Really cool graphic omitted – see below)

Social applications and Graph Databases go together like peanut butter and jelly. I’m going to walk you through the steps of building an application that connects to Facebook, pulls your friends and likes data and visualizes it. I plan on making a video of me coding it one line at a time, but for now let’s just focus on the main elements.

The application will have two major components:

  1. A web service that handles authentication and displaying of friends, likes, and so-on.
  2. A background job service that imports data from Facebook.

We will be deploying this application on Heroku and making use of the RedisToGo and Neo4j Add-ons.

A very good weekend project for Facebook and Neo4j.

I have a different solution when you have too many friends to make a nice graphic (> 50):

Get a bigger monitor. 😉

April 15, 2012

Facebook Search- The fall of the machines

Filed under: Facebook,Search Engines,Searching — Patrick Durusau @ 7:15 pm

Facebook Search- The fall of the machines by Ajay Ohri.

Ajay gives five numbered reasons and then one more for preferring Facebook searching.

I hardly ever visit Facebook (I do have an account) and certainly don’t search using it.

But we could trade stories, rumors, etc. all day.

How would we test Facebook versus other search engines?

Or for that matter, how would we test search engines in general?

When we say search A got a “better” result using search engine Z, by what measure do we mean “better?”

February 27, 2012

Everyone has a Graph Store

Filed under: Facebook,Graphs — Patrick Durusau @ 8:23 pm

Everyone has a Graph Store by Danny Ayers.

Try this thought experiment.

For practical purposes we often assume that everyone has a computer, a reasonable Internet connection and a modern Web browser. We know it’s an inaccurate assumption, but it provides conceptual targets for technology in terms of people and environment.

Ok, now add to that list a Graph Store: a flexible database to which information can easily be added, and which can be easily queried. The data can also be easily shared over the Cloud. The data is available for any applications that might want to use it. The database is schemaless, agnostic about what you put in it: the data could be about contacts, descriptions of people & their relationships (i.e. a Social Graph), it could be about places or events, products, technical information, whatever. It can contain private information, it can contain information that you’re happy to share. You control your own store and can let other people access as much or as little of its contents as you like (which they can do easily over the cloud). You can access other people’s store in the same way, according to their preferences. It’s both a Personal Knowledgebase and a Federated Public Knowledgebase.

So, make the assumption: everyone has a Graph Store. Now what do you want to do with yours? What can your friends and colleagues do with theirs? How can you use other peoples information to improve your quality of life, and vice versa? What new tools can be developed to help them take advantage of their stores? How can you get rich quick on this? What other questions are there..?

When I do this thought experiment, all I come up with is Facebook. So I am not very encouraged.

Perhaps Danny is expecting a natural clumping of useful comments and insights. Certainly is possible but then clumpings around Jim Jones and Jimmy Swaggart are also possible.

Or that a process of collective commenting and consideration will lead to useful results. American Idol isn’t strong evidence that mass participation produces good results. Or American election results.

Your thought experiment results may vary so feel free to report them.

Graphs are a great idea. Asking everyone to write down their thoughts in a graph store, not so great.

December 15, 2011

EMC Greenplum puts a social spin on big data

Filed under: BigData,Chorus,Facebook,Hadoop — Patrick Durusau @ 7:53 pm

EMC Greenplum puts a social spin on big data

From the post:

Greenplum, the analytics division of EMC, has announced new software that lets data analysts explore all their organization’s data and share interesting findings and data sets Facebook-style among their colleagues. The product is called Chorus, and it wraps around EMC’s Greenplum Database and Hadoop distribution, making all that data available for the data team work with.

The pitch here is about unifying the analytic database and Hadoop environments and making it as easy and collaborative as possible to work with data, since EMC thinks a larger percentage of employees will have to figure out how to analyze business data. Plus, because EMC doesn’t have any legacy database or business intelligence products to protect, the entire focus of the Greenplum division is on providing the best big-data experience possible.

From the Chorus product page:

Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from anywhere in the organization. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data. Customers deploy Chorus to create a self-service agile analytic infrastructure; teams can create workspaces on the fly with self-service provisioning, and then instantly start creating and sharing insights.

Chorus breaks down the walls between all of the individuals involved in the data science team and empowers everyone who works with your data to more easily collaborate and derive insight from that data.

Note to EMC Greenplum: If you want people to at least consider products, don’t hide them so that searching is necessary to find them. Just an FYI.

Resources is pretty thin but better than the blah-blah “more information page.” Could have more details, perhaps a demo version?

A button that says “Contact Sales” makes me loose interest real quick. I don’t need some software sales person pinging me during an editing cycle to know if I have installed the “free” software yet and am I ready to order? Buying software really should be on my schedule, not his/hers. Yes?

December 12, 2011

Extracting data from the Facebook social graph with expressor, a Tutorial

Filed under: Expressor,Facebook,Social Graphs — Patrick Durusau @ 10:21 pm

Extracting data from the Facebook social graph with expressor, a Tutorial by Michael Tarallo.

From the post:

In my last article,Enterprise Application Integration with Social Networking Data, I describe how social networking sites, such as Facebook and Twitter, provide APIs to communicate with the various components available in these applications. One in particular, is their “social graph” API which enables software developers to create programs that can interface with the many “objects” stored within these graphs.

In this article, I will briefly review the Facebook social graph and provide a simple tutorial with an expressor downloadable project. I will cover how expressor can extract data using the Facebook graph API and flatten it by using the provided reusable Datascript Module. I will also demonstrate how to add new user defined attributes to the expressor Dataflow so one can customize the output needed.

Looks interesting.

Seems appropriate after starting today’s posts with work on the ODP files.

As you know, I am not a big fan of ETL but it has been a survivor. And if the folks who are signing off on the design want ETL, maybe it isn’t all that weird. 😉

« Newer Posts

Powered by WordPress