Archive for the ‘Graph Database Benchmark’ Category

Trillion-Edge Graphs – Dodging Cost and the NSA

Tuesday, May 30th, 2017

Mosaic: processing a trillion-edge graph on a single machine by Adrian Colyer.

From the post:

Mosaic: Processing a trillion-edge graph on a single machine Maass et al., EuroSys’17

Unless your graph is bigger than Facebook’s, you can process it on a single machine.

With the inception of the internet, large-scale graphs comprising web graphs or social networks have become common. For example, Facebook recently reported their largest social graph comprises 1.4 billion vertices and 1 trillion edges. To process such graphs, they ran a distributed graph processing engine, Giraph, on 200 machines. But, with Mosaic, we are able to process large graphs, even proportional to Facebook’s graph, on a single machine.

In this case it’s quite a special machine – with Intel Xeon Phi coprocessors and NVMe storage. But it’s really not that expensive – the Xeon Phi used in the paper costs around $549, and a 1.2TB Intel SSD 750 costs around $750. How much do large distributed clusters cost in comparison? Especially when using expensive interconnects and large amounts of RAM.

So Mosaic costs less, but it also consistently outperforms other state-of-the-art out of core (secondary storage) engines by 3.2x-58.6x, and shows comparable performance to distributed graph engines. At one trillion edge scale, Mosaic can run an iteration of PageRank in 21 minutes (after paying a fairly hefty one-off set-up cost).

(And remember, if you have a less-than-a-trillion edges scale problem, say just a few billion edges, you can do an awful lot with just a single thread too!).

Another advantage of the single machine design, is a much simpler approach to fault tolerance:

… handling fault tolerance is as simple as checkpointing the intermediate stale data (i.e., vertex array). Further, the read-only vertex array for the current iteration can be written to disk parallel to the graph processing; it only requires a barrier on each superstep. Recovery is also trivial; processing can resume with the last checkpoint of the vertex array.

There’s a lot to this paper. Perhaps the two most central aspects are design sympathy for modern hardware, and the Hilbert-ordered tiling scheme used to divide up the work. So I’m going to concentrate mostly on those in the space available.

A publicly accessible version of the paper: Mosaic: Processing a trillion-edge graph on a single machine. Presentation slides.

Definitely a paper for near the top of my reading list!

Shallow but broad graphs (think telephone surveillance data) are all the rage but how would relatively narrow but deep graphs fare when being processed by Mosaic?

Using top-end but not uncommon hardware may enable your processing requirements to escape the notice of the NSA. Another benefit to commodity hardware.

Enjoy!

S3G2: A Scalable Structure-Correlated Social Graph Generator

Sunday, February 24th, 2013

S3G2: A Scalable Structure-Correlated Social Graph Generator by Minh-Duc Pham, Peter Boncz, Orri Erling. (The same text you will find at: Selected Topics in Performance Evaluation and Benchmarking Lecture Notes in Computer Science Volume 7755, 2013, pp 156-172. DOI: 10.1007/978-3-642-36727-4_11)

Abstract:

Benchmarking graph-oriented database workloads and graph-oriented database systems is increasingly becoming relevant in analytical Big Data tasks, such as social network analysis. In graph data, structure is not mainly found inside the nodes, but especially in the way nodes happen to be connected, i.e. structural correlations. Because such structural correlations determine join fan-outs experienced by graph analysis algorithms and graph query executors, they are an essential, yet typically neglected, ingredient of synthetic graph generators. To address this, we present S3G2: a Scalable Structure-correlated Social Graph Generator. This graph generator creates a synthetic social graph, containing non-uniform value distributions and structural correlations, which is intended as test data for scalable graph analysis algorithms and graph database systems. We generalize the problem by decomposing correlated graph generation in multiple passes that each focus on one so-called correlation dimension; each of which can be mapped to a MapReduce task. We show that S3G2 can generate social graphs that (i) share well-known graph connectivity characteristics typically found in real social graphs (ii) contain certain plausible structural correlations that influence the performance of graph analysis algorithms and queries, and (iii) can be quickly generated at huge sizes on common cluster hardware.

You may also want to see the slides.

What a nice way to start the week!

Enjoy!

I first saw this at Datanami.

GRADES: Graph Data-management Experiences & Systems

Saturday, December 29th, 2012

GRADES: Graph Data-management Experiences & Systems

Workshop: Sunday June 23, 2013

Papers Due: March 31, 2013

Notification: April 22, 2013

Camera-ready: May 19, 2013

Workshop Scope:

Application Areas

A new data economy is emerging, based on the analysis of distributed, heterogeneous, and complexly structured data sets. GRADES focuses on the problem of managing such data, specifically when data takes the form of graphs that connect many millions of nodes, and the worth of the data and its analysis is not only in the attribute values of these nodes, but in the way these nodes are connected. Specific application areas that exhibit the growing need for management of such graph shaped data include:

  • life science analytics, e.g., tracking relationships between illnesses, genes, and molecular compounds.
  • social network marketing, e.g., identifying influential speakers and trends propagating through a community.
  • digital forensics, e.g., analyzing the relationships between persons and entities for law enforcement purposes.
  • telecommunication network analysis, e.g., directed at fixing network bottlenecks and costing of network traffic.
  • digital publishing, e.g., enriching entities occurring in digital content with external data sources, and finding relationships among the entities.

Perspectives

The GRADES workshop solicits contributions from two perspectives:

  • Experiences. This includes topics that describe use case scenarios, datasets, and analysis opportunities occurring in real-life graph-shaped, ans well as benchmark descriptions and benchmark results.
  • Systems. This includes topics that describe data management system architectures for processing of Graph and RDF data, and specific techniques and algorithms employed inside such systems.

The combination of the two (Experiences with Systems) and benchmarking RDF and graph database systems, is of special interest.

Topics Of Interest

The following is a non-exhaustive list describing the scope of GRADES:

  • vision papers describing potential applications and benefits of graph data management.
  • descriptions of graph data management use cases and query workloads.
  • experiences with applying data management technologies in such situations.
  • experiences or techniques for specific operations such as traversals or RDF reasoning.
  • proposals for benchmarks for data integration tasks (instance matching and ETL techniques).
  • proposals for benchmarks for RDF and graph database workloads.
  • evaluation of benchmark performance results on RDF or graph database systems.
  • system descriptions covering RDF or graph database technology.
  • data and index structures that can be used in RDF and graph database systems.
  • query processing and optimization algorithms for RDF and graph database systems.
  • methods and technique for measuring graph characteristics.
  • methods and techniques for visualizing graphs and graph query results.
  • proposals and experiences with graph query languages.

The GRADES workshop is co-located and sponsored by SIGMOD in recognition that these problems are only interesting at large-scale and the contribution of the SIGMOD community to handle such topics on large amounts of data of many millions or even billions of nodes and edges is of critical importance.

That sounds promising doesn’t it? (Please email, copy, post, etc.)

Graph Mining: Laws, Generators, and Algorithms

Friday, February 24th, 2012

Graph Mining: Laws, Generators, and Algorithms by Deepayan Chakrabarti and Christos Faloutsos.

Abstract:

How does theWeb look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs?” To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.

If any readers of this blog have doubts about the need for mappings between terminology in fields of research (topic maps), consider the authors remarking:

…we need to detect patterns in graphs and then generate synthetic graphs matching such patterns automatically.

This is a hard problem. What patterns should we look for? What do such patterns mean? How can we generate them? A lot of research ink has been spent on this problem, not only by computer scientists but also physicists, mathematicians, sociologists, and others. However, there is little interaction among these fields with the result that they often use different terminology and do not benefit from each other’s advances. In this survey, we attempt to give an overview of the main ideas. Our focus is on combining sources from all the different fields to gain a coherent picture of the current stateof-the-art. The interested reader is also referred to some excellent and entertaining books on the topic, namely, Barabási [2002],Watts [2003], and Dorogovtsev and Mendes [2003]. (emphasis added)

Extremely detailed survey with copious references. Dates from 2006.

Do you know of a later cross-field survey that updates this article?

This would make a good article for the Reading club on Graph databases and distributed systems.

A Discussion on the Design of Graph Database Benchmarks

Friday, February 24th, 2012

A Discussion on the Design of Graph Database Benchmarks by David Dominguez-Sal, Norbert Martinez-Bazan, Victor Muntes-Mulero, Pere Baleta, and Josep Lluis Larriba-Pey.

Abstract:

Graph Database Management systems (GDBs) are gaining popularity. They are used to analyze huge graph datasets that are naturally appearing in many application areas to model interrelated data. The objective of this paper is to raise a new topic of discussion in the benchmarking community and allow practitioners having a set of basic guidelines for GDB benchmarking. We strongly believe that GDBs will become an important player in the market field of data analysis, and with that, their performance and capabilities will also become important. For this reason, we discuss those aspects that are important from our perspective, i.e. the characteristics of the graphs to be included in the benchmark, the characteristics of the queries that are important in graph analysis applications and the evaluation workbench.

An in depth discussion of graph benchmarks with pointers to additional literature. I found Table 1, “Graph Operations, Areas of Interest and Categorization” particularly useful as a quick reference when exploring graph benchmark literature.

16 Degrees of the WWW (Graph Database Benchmark)

Friday, February 24th, 2012

I was surprised to learn from Challenges in the Design of a Graph Database Benchmark by Marcus Paradies that 16 is the minimum number of hops between 97% of the nodes on the WWW. (Slide 20)

Kevin Bacon is only 6 degrees away. That should make you curious about the WWW as a graph if nothing else does.

You will also need: Slides – Challenges in the Design of a Graph Database Benchmark (The video is good but not good enough to capture the slide details.)

From the description:

Graph databases are one of the leading drivers in the emerging, highly heterogeneous landscape of database management systems for non-relational data management and processing. The recent interest and success of graph databases arises mainly from the growing interest in social media analysis and the exploration and mining of relationships in social media data. However, with a graph-based model as a very flexible underlying data model, a graph database can serve a large variety of scenarios from different domains such as travel planning, supply chain management and package routing.

During the past months, many vendors have designed and implemented solutions to satisfy the need to efficiently store, manage and query graph data. However, the solutions are very diverse in terms of the supported graph data model, supported query languages, and APIs. With a growing number of vendors offering graph processing and graph management functionality, there is also an increased need to compare the solutions on a functional level as well as on a performance level with the help of benchmarks. Graph database benchmarking is a challenging task. Already existing graph database benchmarks are limited in their functionality and portability to different graph-based data models and different application domains. Existing benchmarks and the supported workloads are typically based on a proprietary query language and on a specific graph-based data model derived from the mathematical notion of a graph. The variety and lack of standardization with respect to the logical representation of graph data and the retrieval of graph data make it hard to define a portable graph database benchmark. In this talk, we present a proposal and design guideline for a graph database benchmark. Typically, a database benchmark consists of a synthetically generated data set of varying size and varying characteristics and a workload driver. In order to generate graph data sets, we present parameters from graph theory, which influence the characteristics of the generated graph data set. Following, the workload driver issues a set of queries against a well-defined interface of the graph database and gathers relevant performance numbers. We propose a set of performance measures to determine the response time behavior on different workloads and also initial suggestions for typical workloads in graph data scenarios. Our main objective of this session is to open the discussion on graph database benchmarking. We believe that there is a need for a common understanding of different workloads for graph processing from different domains and the definition of a common subset of core graph functionality in order to provide a general-purpose graph database benchmark. We encourage vendors to participate and to contribute with their domain-dependent knowledge and to define a graph database benchmark proposal.

What do you think of focusing benchmark efforts on a simple property graph model? (Slide 27)

Perhaps not a bad starting place but I would prefer a roadmap that includes multi-graphs and hypergraphs.