Partitioning Graph Databases by ALEX AVERBUCH & MARTIN NEUMANN.
Focusing on Neo4j, reports that compared to random partitioning, use of algorithms herein result in a reduction of inter-partition traffic by 40 to 90%, depending on the dataset.
Abstract:
The amount of globally stored, electronic data is growing at an increasing rate. This growth is both in size and connectivity, where connectivity refers to the increasing presence of, and interest in, relationships between data [12]. An example of such data is the social network graph created and stored by Twitter [2].
Due to this growth, demand is increasing for technologies that can process such data. Currently relational databases are the predominant data storage technology, but they are poorly suited to processing connected data as they are optimized for index-intensive operations. Conversely, the storage engines of graph databases are optimized for graph computation as they store records adjacent to one another, linked by direct references. This enables retrieval of adjacent elements in constant time, regardless of graph size, and allows for relationships to be followed without performing costly index lookups. However, as data volume increases these databases outgrow the resources available on a single computer, and partitioning the data becomes necessary. At present, few graph databases are capable of doing this [6].
In this work we evaluate the viability of using graph partitioning algorithms as a means of partitioning graph databases, with focus on the Neo4j graph database [4]. For this purpose, a prototype partitioned database was developed. Then, three partitioning algorithms were explored and one implemented. During evaluation, three
graph datasets were used: two from production applications, and one synthetically generated. These were partitioned in various ways and the impact on database
performance was measured. To gauge this impact, we defined one synthetic access pattern per dataset and executed each one on the partitioned datasets. Evaluation took place in a simulation environment, which ensured repeatability and made it possible to measure certain metrics, such as network traffic and load balance.Simulation results show that, compared to random partitioning, use of a graph partitioning algorithm reduced inter-partition traffic by 40{90 %, depending on
dataset. Executing the algorithm intermittently during database usage was shown to maintain partition quality, while requiring only 1% the computation time of
initially partitioning the datasets. Finally, a strong correlation was found between theoretic graph partitioning quality metrics and the generated inter-partition traffic
under non-uniform access patterns. Our results suggest that use of such algorithms to partition graph databases can result in signlficant performance benefits, and
warrants further investigation.