Archive for the ‘Graph Databases’ Category

Database Landscape Map – February 2013

Wednesday, March 27th, 2013

Database Landscape Map – February 2013 by 451 Research.

Database map

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.

Enjoy!

Dydra

Tuesday, March 26th, 2013

Dydra

From the webpage:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

LDBC – Second Technical User Community (TUC) Meeting

Monday, March 18th, 2013

LDBC: Linked Data Benchmark Council – Second Technical User Community (TUC) Meeting – 22/23rd April 2013.

From the post:

The LDBC consortium are pleased to announce the second Technical User Community (TUC) meeting.

This will be a two day event in Munich on the 22/23rd April 2013.

The event will include:

  • Introduction to the objectives and progress of the LDBC project.
  • Description of the progress of the benchmarks being evolved through Task Forces.
  • Users explaining their use-cases and describing the limitations they have found in current technology.
  • Industry discussions on the contents of the benchmarks.

All users of RDF and graph databases are welcome to attend. If you are interested, please contact: ldbc AT ac DOT upc DOT edu.

Further meeting details at the post.

Distributed Graph Computing with Gremlin

Thursday, March 7th, 2013

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

  • Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.

Pattern Based Graph Generator

Monday, March 4th, 2013

Pattern Based Graph Generator by Hong-Han Shuai, De-Nian Yang, Philip S. Yu, Chih-Ya Shen, Ming-Syan Chen.

Abstract:

The importance of graph mining has been widely recognized thanks to a large variety of applications in many areas, while real datasets always play important roles to examine the solution quality and efficiency of a graph mining algorithm. Nevertheless, the size of a real dataset is usually fixed and constrained according to the available resources, such as the efforts to crawl an on-line social network. In this case, employing a synthetic graph generator is a possible way to generate a massive graph (e.g., billions nodes) for evaluating the scalability of an algorithm, and current popular statistical graph generators are properly designed to maintain statistical metrics such as total node degree, degree distribution, diameter, and clustering coefficient of the original social graphs. Nevertheless, in addition to the above metrics, recent studies on graph mining point out that graph frequent patterns are also important to provide useful implications for the corresponding social networking applications, but this crucial criterion has not been noticed in the existing graph generators. This paper first manifests that numerous graph patterns, especially large patterns that are crucial with important domain-specific semantic, unfortunately disappear in the graphs created by popular statistic graph generators, even though those graphs enjoy the same statistical metrics with the original real dataset. To address this important need, we make the first attempt to propose a pattern based graph generator (PBGG) to generate a graph including all patterns and satisfy the user-specified parameters on supports, network size, degree distribution, and clustering coefficient. Experimental results show that PBGG is efficient and able to generate a billion-node graph with about 10 minutes, and PBGG is released for free download.

Download at: http://arbor.ee.ntu.edu.tw/~hhshuai/PBGG.html.

OK, this is not a “moderate size database” program.

A New Representation of WordNet® using Graph Databases

Monday, March 4th, 2013

A New Representation of WordNet® using Graph Databases by Khaled Nagi.

Abstract:

WordNet® is one of the most important resources in computation linguistics. The semantically related database of English terms is widely used in text analysis and retrieval domains, which constitute typical features, employed by social networks and other modern Web 2.0 applications. Under the hood, WordNet® can be seen as a sort of read-only social network relating its language terms. In our work, we implement a new storage technique for WordNet® based on graph databases. Graph databases are a major pillar of the NoSQL movement with lots of emerging products, such as Neo4j. In this paper, we present two Neo4j graph storage representations for the WordNet® dictionary. We analyze their performance and compare them to other traditional storage models. With this contribution, we also validate the applicability of modern graph databases in new areas beside the typical large-scale social networks with several hundreds of millions of nodes.

Finally, a paper that covers “moderate size databases!”

Think about the average graph database you see on this blog. Not really in the “moderate” range, even though a majority of users work in the moderate range.

Compare the number of Facebook size enterprises with the number of enterprises generally.

Not dissing super-sized graph databases or research on same. I enjoy both a lot.

But for your average customer, experience with “moderate size databases” may be more immediately relevant.

I first saw this in a tweet from Peter Neubauer.

GRADES: Graph Data-management Experiences & Systems

Saturday, December 29th, 2012

GRADES: Graph Data-management Experiences & Systems

Workshop: Sunday June 23, 2013

Papers Due: March 31, 2013

Notification: April 22, 2013

Camera-ready: May 19, 2013

Workshop Scope:

Application Areas

A new data economy is emerging, based on the analysis of distributed, heterogeneous, and complexly structured data sets. GRADES focuses on the problem of managing such data, specifically when data takes the form of graphs that connect many millions of nodes, and the worth of the data and its analysis is not only in the attribute values of these nodes, but in the way these nodes are connected. Specific application areas that exhibit the growing need for management of such graph shaped data include:

  • life science analytics, e.g., tracking relationships between illnesses, genes, and molecular compounds.
  • social network marketing, e.g., identifying influential speakers and trends propagating through a community.
  • digital forensics, e.g., analyzing the relationships between persons and entities for law enforcement purposes.
  • telecommunication network analysis, e.g., directed at fixing network bottlenecks and costing of network traffic.
  • digital publishing, e.g., enriching entities occurring in digital content with external data sources, and finding relationships among the entities.

Perspectives

The GRADES workshop solicits contributions from two perspectives:

  • Experiences. This includes topics that describe use case scenarios, datasets, and analysis opportunities occurring in real-life graph-shaped, ans well as benchmark descriptions and benchmark results.
  • Systems. This includes topics that describe data management system architectures for processing of Graph and RDF data, and specific techniques and algorithms employed inside such systems.

The combination of the two (Experiences with Systems) and benchmarking RDF and graph database systems, is of special interest.

Topics Of Interest

The following is a non-exhaustive list describing the scope of GRADES:

  • vision papers describing potential applications and benefits of graph data management.
  • descriptions of graph data management use cases and query workloads.
  • experiences with applying data management technologies in such situations.
  • experiences or techniques for specific operations such as traversals or RDF reasoning.
  • proposals for benchmarks for data integration tasks (instance matching and ETL techniques).
  • proposals for benchmarks for RDF and graph database workloads.
  • evaluation of benchmark performance results on RDF or graph database systems.
  • system descriptions covering RDF or graph database technology.
  • data and index structures that can be used in RDF and graph database systems.
  • query processing and optimization algorithms for RDF and graph database systems.
  • methods and technique for measuring graph characteristics.
  • methods and techniques for visualizing graphs and graph query results.
  • proposals and experiences with graph query languages.

The GRADES workshop is co-located and sponsored by SIGMOD in recognition that these problems are only interesting at large-scale and the contribution of the SIGMOD community to handle such topics on large amounts of data of many millions or even billions of nodes and edges is of critical importance.

That sounds promising doesn’t it? (Please email, copy, post, etc.)

How To Use A Graph Database to Integrate And Analyze Relational Exports

Sunday, July 15th, 2012

How To Use A Graph Database to Integrate And Analyze Relational Exports by Todd Stavish.

From the post:

Graph databases can be used to analyze data from disparate datasources. In this use-case, three relational databases have been exported to CSV. Each relational export is ingested into its own sharded sub-graph to increase performance and avoid lock contention when merging the datasets. Unique keys overlap the datasources to provide the mechanism to link the subgraphs produced from parsing the CSV. A REST server is used to send the merged graph to a visualization application for analysis.

Cleaning out my pending posts file when I ran this one.

Would be a good comparison case for my topic maps class.

Although I would have to do in installation work on a public facing server and leave the class members to do the analysis/uploading.

Hmmm, perhaps split the class into teams, some of which using this method, some using more traditional record linkage and some using topic maps, all on the same data.

Suggestions on data sets that would highlight the differences? Or result in few differences at all? (I suspect both to be true, depending upon the data sets.)

Search Algorithms for Conceptual Graph Databases

Friday, July 13th, 2012

Search Algorithms for Conceptual Graph Databases by Abdurashid Mamadolimov.

Abstract:

We consider a database composed of a set of conceptual graphs. Using conceptual graphs and graph homomorphism it is possible to build a basic query-answering mechanism based on semantic search. Graph homomorphism defines a partial order over conceptual graphs. Since graph homomorphism checking is an NP-Complete problem, the main requirement for database organizing and managing algorithms is to reduce the number of homomorphism checks. Searching is a basic operation for database manipulating problems. We consider the problem of searching for an element in a partially ordered set. The goal is to minimize the number of queries required to find a target element in the worst case. First we analyse conceptual graph database operations. Then we propose a new algorithm for a subclass of lattices. Finally, we suggest a parallel search algorithm for a general poset.

While I have no objection to efficient solutions for particular cases, as a general rule those solutions are valid for some known set of cases.

Here we appear to have an efficient solution for some unknown number of cases. I mention it to keep in mind while watching the search literature on graph databases develop.

Titan

Wednesday, May 30th, 2012

Titan

Alpha Release Coming June 5, 2012

From the homepage:

Titan is a distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster. The primary features of Titan are itemized below.

If the names Marko A. Rodriguez or Matthias Broecheler mean anything to you, June 5th can’t come soon enough!

Data Locality in Graph Databases through N-Body Simulation

Thursday, April 5th, 2012

Data Locality in Graph Databases through N-Body Simulation by Dominic Pacher, Robert Binna, and Günther Specht.

Abstract:

Data locality poses a major performance requirement in graph databases, since it forms a basis for efficient caching and distribution. This vision paper presents a new approach to satisfy this requirement through n-body simulation. We describe our solution in detail and provide a theoretically complexity estimation of our method. To prove our concept, we conducted an evaluation using the DBpedia dataset data. The results are promising and show that n-body simulation is capable to improve data locality in graph databases significantly.

My first reaction was why clustering of nodes wasn’t compared to n-body simulation? That seems like an equally “natural” method to achieve data locality.

My second reaction was that the citation of “…Simulations of the formation, evolution and clustering of galaxies and quasars. nature, 435(7042):629–636, jun 2005. (citation 16 in the article) was reaching in terms of support for scaling. That type of simulation involves a number of simplifying assumptions that aren’t likely to be true for most graphs.

Imaginative work but it needs a little less imagination and a bit rigor in terms of its argument/analysis.

Graph Databases Make Apps Social

Saturday, March 31st, 2012

Graph Databases Make Apps Social by Adrian Bridgwater.

Adrian writes:

Neo Technology has suggested that social graph database technology will become a key trend in the data science arena throughout 2012 and beyond. On the back of vindicating comments made by Forrester analyst James Kobielus, the company contends that social graph complexities are needed to meet the high query performance levels now required inside Internet scale cloud applications.

Unsurprisingly a vendor of graph database technology itself (although Neo4j is open source at heart before its commercially supported equivalent), Neo Technology points to social graph capabilities, which take information across a range of networks to understand the relationships between individuals.

Sounds like applications of interest to DoD/DARPA doesn’t it?

Related

Thursday, March 29th, 2012

Related

From the webpage:

Related

Related is a Redis-backed high performance distributed graph database.

Raison d’être

Related is meant to be a simple graph database that is fun, free and easy to use. The intention is not to compete with “real” graph databases like Neo4j, but rather to be a replacement for a relational database when your data is better described as a graph. For example when building social software. Related is very similar in scope and functionality to Twitters FlockDB, but is among other things designed to be easier to setup and use. Related also has better documentation and is easier to hack on. The intention is to be web scale, but we ultimately rely on the ability of Redis to scale (using Redis Cluster for example). Read more about the philosophy behind Related in the Wiki.

Well, which is it?

A “Redis-backed high performance distributed graph database,”

or

“…not to compete with “real” graph databases like Neo4j….?”

If the intent is to have a “web scale” distributed graph database, then it will be competing with other graph database products.

If you are building a graph database, keep an eye on René Pickhardt’s blog for notices about the next meeting of his graph reading club.

PhD proposal on distributed graph data bases

Tuesday, March 27th, 2012

PhD proposal on distributed graph data bases by René Pickhardt.

From the post:

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology.

If you have the time, please read René’s proposal and comment on it.

Although I am no stranger to multi-year research projects, ;-) , I must admit to pausing when I read:

Here I will name the at least the related work in the following fields:

  • graph processing (Signal Collect, Pregel,…)
  • graph theory (especially data structures and algorithms)
  • (dynamic/adaptive) graph partitioning
  • distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
  • redundancy vs fault tolerance
  • network programming (protocols, latency vs bandwidth)
  • data bases (ACID, multiple user access, …)
  • graph data base query languages (SPARQL, Gremlin, Cypher,…)
  • Social Network and graph analysis and modelling.

Unless René is planning on taking the most recent citations in each area, describing related work and establishing how it is related to “distributed graph data bases,” will consume the projected time period for his dissertation work.

Each of the areas listed is a complete field unto itself and has many PhD sized research problems related to “distributed graph data bases.”

Almost all PhD proposals start with breath taking scope but the ones that make a real contribution (and are completed), identify specific problems that admit to finite research programs.

I think René should revise his proposal to focus on some particular aspect of “distributed graph data bases.” I suspect even the history of one aspect of such databases will expand fairly rapidly upon detailed consideration.

The need for a larger, global perspective on “distributed graph data bases” will still be around after René finishes a less encompassing dissertation. I promise.

What is your advice?

FDB: A Query Engine for Factorised Relational Databases

Tuesday, March 20th, 2012

FDB: A Query Engine for Factorised Relational Databases by Nurzhan Bakibayev, Dan Olteanu, and Jakub Závodný.

Abstract:

Factorised databases are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. This paper introduces FDB, an in-memory query engine for select-project-join queries on factorised databases. Key components of FDB are novel algorithms for query optimisation and evaluation that exploit the succinctness brought by data factorisation. Experiments show that for data sets with many-to-many relationships FDB can outperform relational engines by orders of magnitude.

It is twelve pages of dense slogging but I wonder if you have a reaction to:

Finally, factorised representations are relational algebra expressions with well-understood semantics. Their relational nature sets them apart from XML documents, object-oriented databases, and nested objects [2], where the goal is to avoid the rigidity of the relational model. (on the second page)

Where [2] is: S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

Online version of Foundations of Databases

DBLP has a nice listing of the references (with links) in Foundations of Databases

Abiteboul and company are cited without a page reference (my printed edition is 685 pages long) and the only comparison that I can uncover between the relational model and any of those mentioned here is that an object-oriented database has oids, which aren’t members of a “printable class” as are keys.

I am not sure what sort of oid isn’t a member of a “printable” class but am willing to leave that to one side for the moment.

My problem is with the characterization “…to avoid the rigidity of the relational model.”

The relational model has been implemented in any number of rigid ways, but is that the fault of a model based on operations on tuples, which can be singletons?

What if factorisation were applied to a graph database, composed of singletons, enabling the use of “…relational algebra expressions with well-understood semantics.”?

It sounds like factorisation could speed up classes of “expected” queries across graph databases. I don’t think anyone creates a database, graph or otherwise, without some classes of queries in mind. The user would be no worse off when they create an unexpected query.

Social networks in the database: using a graph database

Sunday, March 4th, 2012

Social networks in the database: using a graph database

The Neo4j response to Lorenzo Alberton’s post on social networks in a relational database.

From the post:

Recently Lorenzo Alberton gave a talk on Trees In The Database where he showed the most used approaches to storing trees in a relational database. Now he has moved on to an even more interesting topic with his article Graphs in the database: SQL meets social networks. Right from the beginning of his excellent article Alberton puts this technical challenge in a proper context:

Graphs are ubiquitous. Social or P2P networks, thesauri, route planning systems, recommendation systems, collaborative filtering, even the World Wide Web itself is ultimately a graph! Given their importance, it’s surely worth spending some time in studying some algorithms and models to represent and work with them effectively.

After a brief explanation of what a graph data structure is, the article goes on to show how graphs can be represented in a table-based database. The rest of the article shows in detail how an adjacency list model can be used to represent a graph in a relational database. Different examples are used to illustrate what can be done in this way.

Graph databases and Neo4j in particular offer advantages when used with graphs but the Neo4j post overlooks several points.

Unlike graph databases, SQL databases are nearly, if not always, ubiquitous. It may well be that the first “taste” of graph processing may come via a SQL database and lead users to expect more graph capabilities than a SQL solution can offer.

As Lorenzo points out in his posting, performance will vary depending upon the graph operations you need to perform. True for SQL databases and graph databases as well. Having a graph database doesn’t mean all graph algorithms run efficiently on your data set.

Finally:

A table-based system makes a good fit for static and simple data structures, ….

Isn’t going to ring true for anyone familiar with Oracle, PostgreSQL, MySQL, SQL Server, Informix, DB2 or any number of other “table-based systems.”

Solving Problems on Recursively Constructed Graphs

Wednesday, February 29th, 2012

Solving Problems on Recursively Constructed Graphs by Richard B. Borie , R. Gary Parker , Craig A. Tovey.

Abstract:

Fast algorithms can be created for many graph problems when instances are confined to classes of graphs that are recursively constructed. This article first describes some basic conceptual notions regarding the design of such fast algorithms, and then the coverage proceeds through several recursive graph classes. Specific classes include trees, series-parallel graphs, k-terminal graphs, treewidth-k graphs, k-trees, partial k-trees, k-jackknife graphs, pathwidth-k graphs, bandwidth-k graphs, cutwidth-k graphs, branchwidth-k graphs, Halin graphs, cographs, cliquewidth-k graphs, k-NLC graphs, k-HB graphs, and rankwidth-k graphs. The definition of each class is provided. Typical algorithms are applied to solve problems on instances of most classes. Relationships between the classes are also discussed.

Part survey and part tutorial, this article (at 51 pages) will take some time to read in detail.

I wanted to mention it because one of the topics being discussed in the graph reading club will be the partitioning of graph databases.

I suspect, but obviously don’t know for certain, that the graph databases that are constructed in enterprise settings are not going to be random graphs. That is to say some (all?) have repetitive (if not recursive) structures that can be exploited to solve particular graph operations on those databases.

Suggestions of other resources on recursively constructed graphs?

Graph Databases: Information Silo Busters

Wednesday, February 29th, 2012

In a post about InfiniteGraph 2.1 I found the following:

Other big data solutions all lack one thing, Clark contends. There is no easy way to represent the connection information, the relationships across the different silos of data or different data stores, he says. “That is where Objectivity can provide the enhanced storage for actually helping extract and persist those relationships so you can then ask queries about how things are connected.”

(Brian Clark, vice president, Data Management, Objectivity)

It was the last line of the post but I would have sharpened it and made it the lead slug.

Think about what Clark is saying: Not only can we persist relationship information within a datastore but also generate and persist relationship information between datastores. With no restriction on the nature of the datastores.

Try doing that with a relational database and SQL.

What I find particularly attractive is that persisting relationships across datastores means that we can jump the hurdle of making everyone use a common data model. It can be as common (in the graph) as it needs to be and no more.

Of course I think about this as being particularly suited for topic maps as we can document why we have mapped components of diverse data models to particular points in the graph but what did you expect?

But used robustly, graph databases are going to allow you to perform integration across whatever datastores are available to you, using whatever data models they use, and mapped to whatever data model you like. As others may map your graph database to models they prefer as well.

I think the need for documenting those mappings is one that needs attention sooner rather than later.

BTW, feel free to use the phrase “Graph Databases: Information Silo Busters.” (with or without attribution – I want information silos to fall more than I want personal recognition.)