## Archive for the ‘Graph Databases’ Category

### JanusGraph (Linux Foundation Graph Player Rides Into Town)

Wednesday, February 22nd, 2017

JanusGraph

From the homepage:

JanusGraph is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.

In addition, JanusGraph provides the following features:

• Elastic and linear scalability for a growing data and user base.
• Data distribution and replication for performance and fault tolerance.
• Multi-datacenter high availability and hot backups.
• Support for ACID and eventual consistency.
• Support for various storage backends:
• Support for global graph data analytics, reporting, and ETL through integration with big data platforms:
• Support for geo, numeric range, and full-text search via:
• Native integration with the TinkerPop graph stack:
• Open source under the Apache 2 license.

You can clone JanusGraph from GitHub.
Read the JanusGraph documentation and join the users or developers mailing lists.

Follow the Getting Started with JanusGraph guide for a step-by-step introduction.

Supported by Google, IBM and Hortonworks, among others.

Three good reasons to pay attention to JanusGraph early and often.

Enjoy!

### OrientDB 2.2 Beta – But Is It FBI Upgrade Secure?

Monday, February 22nd, 2016

OrientDB 2.2 Beta

From the post:

Our engineers have been working hard to make OrientDB even better and we’re now ready to share it with you! We’re pleased to announce OrientDB 2.2 Beta. We’re really proud of this release as it includes multiple enhancements in both the Community Edition and the the Enterprise Edition. Please note that this is still a beta release, so the software and documentation still have some rough edges to iron out.

We’ve focused on five main themes for this release: Security, Operations, Performance, APIs and Teleporter.

Security for us is paramount as customers are storing their valuable, critical and confidential data in OrientDB. With improved auditing and authentication, password salt and data-at-rest encryption, OrientDB is now one of the most secure DBMSs ever.

Operations

Our users have been asking for incremental backup, so we delivered. Now it’s ready to be tested! That’s not the only addition to version 2.2, as the new OrientDB Studio now replaces the workbench, adding a new architecture and new modules.

Performance

We’ve done multiple optimizations to the core engine and, in many cases, performance increased tenfold. Our distributed capabilities are also constantly improving and we’ve introduced fast-resync of nodes. This release supports the new configurable Graph Consistency to speed-up change operations against graphs.

APIs

“SQL is the English of databases” and we’re constantly improving our SQL access layer to simplify graph operations. New additions include Pattern Matching, Command Cache, Automated Parallel Queries and Live Query graduating from experimental to stable. OrientJS, the official Node driver, now supports native unmarshalling.

Teleporter

You may have heard rumors about a new, easy way to convert your relational databases to OrientDB. It’s time to formally release Teleporter: a new tool to sync with other databases and simplify migrations to OrientDB.

With the FBI attempting to enslave Apple to breach the security of all iPhones, I have to ask if the security features of OrientDB are FBI upgrade secure?

Same question applies to other software as well. OrientDB happens to be the first software release I have seen since the FBI decided it can press gang software vendors to subvert their own software security.

Thoughts on how to protect systems from upgrade removal of security measures?

Unlike the fictional terrorists being pursued by the FBI (ask the TSA, not one terrorist has been seen at a USA airport in the 5277 days since 9/11), the FBI poses a clear and present danger to the American public.

Tim Cook (Apple) and others should stop conceding the terrorist issue to the FBI. It just encourages their delusion to the detriment of us all.

### DegDB (Open Source Distributed Graph Database) [Tackling Who Pays For This Data?]

Tuesday, November 17th, 2015

The Design Doc/Ramble reads in part:

Problems With Existing Graph Databases

• Owned by private companies with no incentive to share.
• Public databases are used by few people with no incentive to contribute.
• Large databases can’t fit on one machine and are expensive to traverse.
• Existing distributed graph databases require all nodes to be trusted.

Incentivizing Hosting of Data

Every request will have either a debit (with attached bitcoin) or credit (with bitcoin promised on delivery) payment system. The server nodes will attempt to estimate how much it will cost to serve the data and if there isn’t enough bitcoin attached, will drop the request. This makes large nodes want to serve as much popular data as possible, because it allows for faster responses as well as not having to pay other nodes for their data. At the same time, little used data will cost more to access due to requiring more hops to find the data and “cold storage” servers can inflate the prices thus making it profitable for them.

Incentivizing Creation of Data

Data Creation on Demand

A system for requesting certain data to be curated can be employed. The requestor would place a bid for a certain piece of data to be curated, and after n-sources add the data to the graph and verify its correctness the money would be split between them.
This system could be ripe for abuse by having bots automatically fulfilling every request with random data.

Creators Paid on Usage

This method involves the consumers of the data keeping track of their data sources and upon usage paying them. This is a trust based model and may end up not paying creators anything.

The one “wow” factor of this project is the forethought to put the discussion of “who pays for this data?” up front and center.

We have all seen the failing model that starts with:

For only $35.00 (U.S.) you can view this article for 24 hours. That makes you feel like you are almost robbing the publisher at that price. (NOT!) Right. I’m tracking down a citation to make sure a quote or data is correct and I am going to pay$35.00 (U.S.) to have access for 24 hours. Considering that the publishers with those pricing models have already made back their costs of production and publication plus a profit from institutional subscribers (challenge them for the evidence if they deny), a very low micro-payment would be more suitable. Say \$00.01 per paragraph or something on that order. Payable out of a deposit with the publisher.

I would amend the Creators Paid on Usage section to have created content unlocked only upon payment (set by the creator). Over time, creators would develop reputations for the value of their data and if you choose to buy from a private seller with no online history, that’s just your bad.

Imagine that for the Paris incident (hypothetical, none of the following is true), I had the school records for half of the people carrying out that attack. Not only do I have the originals but I also have them translated into English, assuming some or all of them are in some other language. I could cast that data (I’m not fond of the poverty of triples) into a graph format and make it know as part of a distributed graph system.

Some of the data, such as the identities of the people for who I had records, would appear in the graphs of others as “new” data. Up to the readers of the graph to decide if the data and the conditions for seeing it are acceptable to them.

Data could even carry a public price tag. That is if you want to pay a large enough sum, then the data in question will be opened up for everyone to have access to it.

I don’t know of any micropayment systems that are eating at the foundations of traditional publishers now but there will be many attempts before one eviscerates them one and all.

The choices we face now of “free” (read unpaid for research, writing and publication, which excludes many) versus the “pay-per-view” model that supports early 20th century models of sloth, cronyism and gate-keeping, aren’t the only ones. We need to actively seek out better and more nuanced choices.

### 20+ Free Graph Databases

Monday, June 1st, 2015

20+ Free Graph Databases

The original post has prose descriptions of each database but for my purposes a list of links is more than sufficient:

Like any other software, you should evaluate these graphs databases against your requirements and data. Do not assume that the terminology used in graph database documentation (such as it is) matches your understanding of graph terminology or indeed that used anywhere in published graph literature.

Having said that, some of the graph software listed above is truly extraordinary.

### Graph data management

Thursday, February 19th, 2015

Graph data management by Amol Deshpande.

From the post:

Graph data management has seen a resurgence in recent years, because of an increasing realization that querying and reasoning about the structure of the interconnections between entities can lead to interesting and deep insights into a variety of phenomena. The application domains where graph or network analytics are regularly applied include social media, finance, communication networks, biological networks, and many others. Despite much work on the topic, graph data management is still a nascent topic with many open questions. At the same time, I feel that the research in the database community is fragmented and somewhat disconnected from application domains, and many important questions are not being investigated in our community. This blog post is an attempt to summarize some of my thoughts on this topic, and what exciting and important research problems I think are still open.

At its simplest, graph data management is about managing, querying, and analyzing a set of entities (nodes) and interconnections (edges) between them, both of which may have attributes associated with them. Although much of the research has focused on homogeneous graphs, most real-world graphs are heterogeneous, and the entities and the edges can usually be grouped into a small number of well-defined classes.

Graph processing tasks can be broadly divided into a few categories. (1) First, we may to want execute standard SQL queries, especially aggregations, by treating the node and edge classes as relations. (2) Second, we may have queries focused on the interconnection structure and its properties; examples include subgraph pattern matching (and variants), keyword proximity search, reachability queries, counting or aggregating over patterns (e.g., triangle/motif counting), grouping nodes based on their interconnection structures, path queries, and others. (3) Third, there is usually a need to execute basic or advanced graph algorithms on the graphs or their subgraphs, e.g., bipartite matching, spanning trees, network flow, shortest paths, traversals, finding cliques or dense subgraphs, graph bisection/partitioning, etc. (4) Fourth, there are “network science” or “graph mining” tasks where the goal is to understand the interconnection network, build predictive models for it, and/or identify interesting events or different types of structures; examples of such tasks include community detection, centrality analysis, influence propagation, ego-centric analysis, modeling evolution over time, link prediction, frequent subgraph mining, and many others [New10]. There is much research still being done on developing new such techniques; however, there is also increasing interest in applying the more mature techniques to very large graphs and doing so in real-time. (5) Finally, many general-purpose machine learning and optimization algorithms (e.g., logistic regression, stochastic gradient descent, ADMM) can be cast as graph processing tasks in appropriately constructed graphs, allowing us to solve problems like topic modeling, recommendations, matrix factorization, etc., on very large inputs [Low12].

Prior work on graph data management could itself be roughly divided into work on specialized graph databases and on large-scale graph analytics, which have largely evolved separately from each other; the former has considered end-to-end data management issues including storage representations, transactions, and query languages, whereas the latter work has typically focused on processing specific tasks or types of tasks over large volumes of data. I will discuss those separately, focusing on whether we need “new” systems for graph data management and on open problems.

Very much worth a deep, slow read. Despite marketing claims about graph databases, fundamental issues remain to be solved.

Enjoy!

I first saw this in a tweet by Kirk Borne

### OrientDB Manual – version 1.7.8

Tuesday, August 26th, 2014

OrientDB Manual – version 1.7.8

From the post:

Welcome to OrientDB – the first Multi-Model Open Source NoSQL DBMS that brings together the power of graphs and the flexibility of documents into one scalable, high-performance operational database. OrientDB is sponsored by Orient Technologies, LTD.

OrientDB has features of both Document and Graph DBMSs. Written in Java and designed to be exceptionally fast: it can store up to 150,000 records per second on common hardware. Not only can it embed documents like any other Document database, but it manages relationships like Graph Databases with direct connections among records. You can traverse parts of or entire trees and graphs of records in a few milliseconds.

OrientDB supports schema-less, schema-full and schema-mixed modes and it has a strong security profiling system based on users and roles. Thanks to the SQL layer, OrientDB query language is straightforward and easy to use, especially for those skilled in the relational DBMS world.

Take a look at some OrientDB Presentations.

A new version of the documentation for OrientDB. I saw this last week but forgot to mention it.

### Cayley

Wednesday, June 25th, 2014

Cayley – An open-source graph database

From the webpage:

Cayley is an open-source graph inspired by the graph database behind Freebase and Google’s Knowledge Graph.

Its goal is to be a part of the developer’s toolbox where Linked Data and graph-shaped data (semantic webs, social networks, etc) in general are concerned.

Features

• Written in Go
• Easy to get running (3 or 4 commands, below)
• RESTful API
• or a REPL if you prefer
• Built-in query editor and visualizer
• Multiple query languages:
• Javascript, with a Gremlin-inspired* graph object.
• Plays well with multiple backend stores:
• Modular design; easy to extend with new languages and backends
• Good test coverage
• Speed, where possible.

Rough performance testing shows that, on consumer hardware and an average disk, 134m triples in LevelDB is no problem and a multi-hop intersection query — films starring X and Y — takes ~150ms.

If you are seriously thinking about a graph database, see also these comments. Not everything you need to know but useful comments none the less.

I first saw this in a tweet from Hacker News.

### Titan: Scalable Graph Database

Tuesday, April 15th, 2014

Titan: Scalable Graph Database by Matthias Broecheler.

Conference presentation so long on imagery but short on detail. 😉

However, useful to walk your manager through as a pitch for support to investigate further.

When that support is given, check out: http://thinkaurelius.github.io/titan/. Links to source code, other resources, etc.

### Graph Databases – 250% Spike in Popularity – Really?

Saturday, January 25th, 2014

I prefer graph databases for a number of reasons but the rhetoric about them has gotten completely out of hand.

The most recent Internet rumor is that graph database had a 250% spike in popularity.

Really?

Care to guess how that “measurement” was taken? It was more intellectually honest than Office of Management and Budget‘s sequestration numbers, but only just.

Here are the parameters for the 250% increase:

• Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for “ database”, e.g. “Oracle database”.
• General interest in the system. For this measurement, we use the frequency of searches in Google Trends.
• Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
• Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.
• Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn.

We calculate the popularity value of a system by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way ​​so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria.

The DB-Engines Ranking does not measure the number of installations of the systems, or their use within IT systems. It can be expected, that an increase of the popularity of a system as measured by the DB-Engines Ranking (e.g. in discussions or job offers) precedes a corresponding broad use of the system by a certain time factor. Because of this, the DB-Engines Ranking can act as an early indicator. (emphasis added) (Source: DB-Engines)

So, this 250% increase in popularity is like a high school cheerleader election. Yes?

Oracle, may have signed several nation level contracts in the past year but are outdistanced in the rankings by twitter traffic?

Not what I would call reliable intelligence.

PS: the rumor apparently originates with: Tables turning? Graph databases see 250% spike in popularity by Lucy Carey.

Personally I can’t see how Lucy got 250% out of the reported numbers. There is a story about repeating something so often that it is believed. Do you remember it?

### Why Relationships are cool…

Sunday, December 8th, 2013

Why Relationships are cool but the “JOIN” sucks by Luca Garulli.

I have been trying to avoid graph “intro” slides and presentations.

There are only so many times you can stand to hear “…all the world is a graph…” as though that’s news. To anyone.

This presentation by Luca is different from the usual introduction to graphs presentation.

Most of my readers won’t learn anything new but it may bump them into thinking of new ways to advocate the use of graphs.

By the way, Luca is responsible for OrientDB.

OrientDB version 1.6.1 was released on November 20, 2013, so if you haven’t looked at OrientDB in a while, now might be the time.

### How to Use Graph Databases… [Topic Maps as Graph++?]

Tuesday, November 26th, 2013

You have a choice of titles:

How to Use Graph Databases to Analyze Relationships, Risks and Business Opportunities (YouTube)

Graph Databases, Triple Stores and their uses… (slides of Jans Aasman at Enterprise Data World 2012)

From the description:

Graph databases are one of the new technologies encouraging a rapid re-thinking of the analytics landscape. By tracking relationships – in a network of people, organizations, events and data – and applying reasoning (inference) to the data and connections, powerful new answers and insights are enabled.

This presentation will explain how graph databases work, and how graphs can be used for a number of important functions, including risk management, relationship analysis and the identification of new business opportunities. It will use a case study in the manufacturing sector to demonstrate how complex relationships can be discovered and integrated into analytical systems. For example, what are the repercussions for the supply chain of a major flood in China? Which products are affected by political unrest in Thailand? Has a sub-subcontractor started selling to our competition and what does that mean for us? What happened historically to the price of an important sub-component when the prices for crude oil or any other raw material went up? Lots of answers can be provided by graph (network) analysis that cannot be answered any other way, so it is crucial that business and BI executives learn how to use this important new tool.

At time marks 18:30 to 19:09, major customers who are interested in graph databases.
An impressive list of potential customers.

How to Use Graph Databases to Analyze Relationships, Risks and Business Opportunities (YouTube) (9,530 “hits”)

Graph Databases, Triple Stores and their uses… (slides of Jans Aasman at Enterprise Data World 2012) (7 “hits”)

If you pick the wrong title as your search string, you will miss 9,523 mentions of this video on the WWW.

The same danger comes up when you rely on normalized data, the sort of data you saw in this video.

If the data you are searching has missed data that needs to be normalized, well, you just don’t find the data.

With a topic map based system, normalization isn’t necessary so long as there is mapping in the topic map.

Think of it this way, you can normalize data over and over again, making it unusable by its source, or you can create a mapping rule into a topic map once.

And the data remains findable by its original creator or source.

I would say yes, topic maps are graphs++, they don’t require normalization.

### Massive Query Expansion by Exploiting Graph Knowledge Bases

Friday, October 25th, 2013

Massive Query Expansion by Exploiting Graph Knowledge Bases by Joan Guisado-Gámez, David Dominguez-Sal, Josep-LLuis Larriba-Pey.

Abstract:

Keyword based search engines have problems with term ambiguity and vocabulary mismatch. In this paper, we propose a query expansion technique that enriches queries expressed as keywords and short natural language descriptions. We present a new massive query expansion strategy that enriches queries using a knowledge base by identifying the query concepts, and adding relevant synonyms and semantically related terms. We propose two approaches: (i) lexical expansion that locates the relevant concepts in the knowledge base; and, (ii) topological expansion that analyzes the network of relations among the concepts, and suggests semantically related terms by path and community analysis of the knowledge graph. We perform our expansions by using two versions of the Wikipedia as knowledge base, concluding that the combination of both lexical and topological expansion provides improvements of the system’s precision up to more than 27%.

Heavy reading for the weekend but this paragraph caught my eye:

In this paper, we propose a novel query expansion which uses a hybrid input in the form of a query phrase and a context, which are a set of keywords and a short natural language description of the query phrase, respectively. Our method is based on the combination of both a lexical and topological analysis of the concepts related to the input in the knowledge base. We differ from previous works because we are not considering the links of each article individually, but we are mining the global link structure of the knowledge base to find related terms using graph mining techniques. With our technique we are able to identify: (i) the most relevant concepts and their synonyms, and (ii) a set of semantically related concepts. Most relevant concepts provide equivalent reformulations of the query that reduce the vocabulary mismatch. Semantically related concepts introduce many different terms that are likely to appear in a relevant document, which is useful to solve the lack of topic expertise and also disambiguate the keywords.

Wondering that since it works with Wikipedia, should the same be true for the references but not hyperlinks of traditional publishing?

Say documents in Citeseer for example?

Nothing against Wikipedia but general knowledge doesn’t have a very high retail value.

### Are Graph Databases Ready for Bioinformatics?

Friday, October 18th, 2013

Are Graph Databases Ready for Bioinformatics? by Christian Theil Have and Lars Juhl Jensen.

From the editorial:

Graphs are ubiquitous in bioinformatics and frequently consist of too many nodes and edges to represent in RAM. These graphs are thus stored in databases to allow for efficient queries using declarative query languages such as SQL. Traditional relational databases (e.g. MySQL and PostgreSQL) have long been used for this purpose and are based on decades of research into query optimization. Recently, NoSQL databases have caught a lot of attention due to their advantages in scalability. The term NoSQL is used to refer to schemaless databases such as key/value stores (e.g. Apache Cassandra), document stores (e.g. MongoDB) and graph databases (e.g. AllegroGraph, Neo4J, OpenLink Virtuoso), which do not fit within the traditional relational paradigm. Most NoSQL databases do not have a declarative query language. The widely used Neo4J graph database is an exception. Its query language Cypher is designed for expressing graph queries, but is still evolving.

Graph databases have so far seen only limited use within bioinformatics [Schriml et al., 2013]. To illustrate the pros and cons of using a graph database (exemplified by Neo4J v1.8.1) instead of a relational database (PostgreSQL v9.1) we imported into both the human interaction network from STRING v9.05 [Franceschini et al., 2013], which is an approximately scale-free network with 20,140 proteins and 2.2 million interactions. As all graph databases, Neo4J stores edges as direct pointers between nodes, which can thus be traversed in constant time. Because Neo4j uses the property graph model, nodes and edges can have properties associated with them; we use this for storing the protein names and the confidence scores associated with the interactions. In PostgreSQL, we stored the graph as an indexed table of node pairs, which can be traversed with either logarithmic or constant lookup complexity depending on the type of index used. On these databases we benchmarked the speed of Cypher and SQL queries for solving three bioinformatics graph processing problems: finding immediate neighbors and their interactions, finding the best scoring path between two proteins, and finding the shortest path between them. We have selected these three tasks because they illustrate well the strengths and weaknesses of graph databases compared to traditional relational databases.

Encouraging but also makes the case for improvements in graph database software.

### NoSQL: Data Grids and Graph Databases

Monday, August 26th, 2013

NoSQL: Data Grids and Graph Databases by Al Rubinger.

Chapter Six of Continuous Enterprise Development in Java by Andrew Lee Rubinger and Aslak Knutsen. Accompanying website.

From chapter six:

Until relatively recently, the RDBMS reigned over data in enterprise applications by a wide margin when contrasted with other approaches. Commercial offerings from Oracle and established open-source projects like MySQL (reborn MariaDB) and PostgreSQL became defacto choices when it came to storing, querying, archiving, accessing, and obtaining data. In retrospect, it’s shocking that given the varying requirements from those operations, one solution was so heavily lauded for so long.

In the late 2000s, a trend away from the strict ACID transactional properties could be clearly observed given the emergence of data stores that organized information differently from the traditional table model:

• Document-oriented
• Object-oriented
• Key/Value stores
• Graph models

In addition, many programmers were beginning to advocate for a release from strict transactions; in many use cases it appeared that this level of isolation wasn’t enough of a priority to warrant the computational expense necessary to provide ACID guarantees.

No, what’s shocking is the degree of historical ignorance among people who criticize RDBMS systems. Either than or they are simply parroting what other ignorant people are saying about RDBMS systems.

Don’t get me wrong, I strongly prefer NoSQL solutions in some cases. But it is a question of requirements and not making up tales about RDBMS systems.

For example, in A transient hypergraph-based model for data access Carolyn Watters and Michael A. Shepherd write:

Two major methods of accessing data in current database systems are querying and browsing. The more traditional query method returns an answer set that may consist of data values (DBMS), items containing the answer (full text), or items referring the user to items containing the answer (bibliographic). Browsing within a database, as best exemplified by hypertext systems, consists of viewing a database item and linking to related items on the basis of some attribute or attribute value. A model of data access has been developed that supports both query and browse access methods. The model is based on hypergraph representation of data instances. The hyperedges and nodes are manipulated through a set of operators to compose new nodes and to instantiate new links dynamically, resulting in transient hypergraphs. These transient hypergraphs are virtual structures created in response to user queries, and lasting only as long as the query session. The model provides a framework for general data access that accommodates user-directed browsing and querying, as well as traditional models of information and data retrieval, such as the Boolean, vector space, and probabilistic models. Finally, the relational database model is shown to provide a reasonable platform for the implementation of this transient hypergraph-based model of data access. (Emphasis added.)

Oh, did I say that paper was written in 1990, some twenty-three years ago?

So twenty-three (23) years ago that bad old RDBMS model was capable of implementing a hypergraph.

A hypergraph that had, wait for it, true hyperedges, not the faux hyperedges claimed by some graph databases.

It’s that lack of accuracy that makes me wonder about what else has been missed?

### OrientDB Graph Database 1.5!

Thursday, August 1st, 2013

From the post:

This release could actually be named 2.0 for all the various new features:

• New PLOCAL (Paginated Local) storage engine.  In comparison with LOCAL, it’s more durable (no usage of MMAP) and supports better concurrency on parallel transactions.
• New Hash Index type with better performance on lookups. It does not support ranges.
• New “transactional” SQL command to execute commands inside a transaction. This is useful for the “create edge” SQL command, in order to avoid corruption of the graph
• Import now migrates RIDs, allowing the ability to import database objects in a different one from the original
• Breadth first” strategy added on traversing (Java andSQL APIs)
• Server can limit maximum live connections (to prevent DOS)
• Fetch plan support in SQL statements and in binary protocol for synchronous commands too
• Various bug fixes

Certainly does sound like a 2.0 release!

### Database Landscape Map – February 2013

Wednesday, March 27th, 2013

Database Landscape Map – February 2013 by 451 Research.

A truly awesome map of available databases.

Originated from Neither fish nor fowl: the rise of multi-model databases by Matthew Aslett.

Matthew writes:

One of the most complicated aspects of putting together our database landscape map was dealing with the growing number of (particularly NoSQL) databases that refuse to be pigeon-holed in any of the primary databases categories.

I have begun to refer to these as “multi-model databases” in recognition of the fact that they are able to take on the characteristics of multiple databases. In truth though there are probably two different groups of products that could be considered “multi-model”:

I think I understand the grouping from the key to the map but the ordering within groups, if meaningful, escapes me.

I am sure you will recognize most of the names but equally sure there will be some you can’t quite describe.

Enjoy!

### Dydra

Tuesday, March 26th, 2013

Dydra

From the webpage:

Dydra

Dydra is a cloud-based graph database. Whether you’re using existing social network APIs or want to build your own, Dydra treats your customers’ social graph as exactly that.

With Dydra, your data is natively stored as a property graph, directly representing the relationships in the underlying data.

Expressive

With Dydra, you access and update your data via an industry-standard query language specifically designed for graph processing, SPARQL. It’s easy to use and we provide a handy in-browser query editor to help you learn.

Despite my misgivings about RDF (Simple Web Semantics), if you want to investigate RDF and SPARQL, Dydra would be a good way to get your feet wet.

You can get an idea of the skill level required by RDF/SPARQL.

Currently in beta, free with some resource limitations.

I particularly liked the line:

We manage every piece of the data store, including versioning, disaster recovery, performance, and more. You just use it.

RDF/SPARQL skills will remain a barrier but Dydra as does its best to make those the only barriers you will face. (And have reduced some of those.)

Definitely worth your attention, whether you simply want to practice on RDF/SPARQL as a data source or have other uses for it.

I first saw this in a tweet by Stian Danenbarger.

### LDBC – Second Technical User Community (TUC) Meeting

Monday, March 18th, 2013

LDBC: Linked Data Benchmark Council – Second Technical User Community (TUC) Meeting – 22/23rd April 2013.

From the post:

The LDBC consortium are pleased to announce the second Technical User Community (TUC) meeting.

This will be a two day event in Munich on the 22/23rd April 2013.

The event will include:

• Introduction to the objectives and progress of the LDBC project.
• Description of the progress of the benchmarks being evolved through Task Forces.
• Users explaining their use-cases and describing the limitations they have found in current technology.
• Industry discussions on the contents of the benchmarks.

All users of RDF and graph databases are welcome to attend. If you are interested, please contact: ldbc AT ac DOT upc DOT edu.

Further meeting details at the post.

### Distributed Graph Computing with Gremlin

Thursday, March 7th, 2013

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

• Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
• Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.

### Pattern Based Graph Generator

Monday, March 4th, 2013

Pattern Based Graph Generator by Hong-Han Shuai, De-Nian Yang, Philip S. Yu, Chih-Ya Shen, Ming-Syan Chen.

Abstract:

The importance of graph mining has been widely recognized thanks to a large variety of applications in many areas, while real datasets always play important roles to examine the solution quality and efficiency of a graph mining algorithm. Nevertheless, the size of a real dataset is usually fixed and constrained according to the available resources, such as the efforts to crawl an on-line social network. In this case, employing a synthetic graph generator is a possible way to generate a massive graph (e.g., billions nodes) for evaluating the scalability of an algorithm, and current popular statistical graph generators are properly designed to maintain statistical metrics such as total node degree, degree distribution, diameter, and clustering coefficient of the original social graphs. Nevertheless, in addition to the above metrics, recent studies on graph mining point out that graph frequent patterns are also important to provide useful implications for the corresponding social networking applications, but this crucial criterion has not been noticed in the existing graph generators. This paper first manifests that numerous graph patterns, especially large patterns that are crucial with important domain-specific semantic, unfortunately disappear in the graphs created by popular statistic graph generators, even though those graphs enjoy the same statistical metrics with the original real dataset. To address this important need, we make the first attempt to propose a pattern based graph generator (PBGG) to generate a graph including all patterns and satisfy the user-specified parameters on supports, network size, degree distribution, and clustering coefficient. Experimental results show that PBGG is efficient and able to generate a billion-node graph with about 10 minutes, and PBGG is released for free download.

OK, this is not a “moderate size database” program.

### A New Representation of WordNet® using Graph Databases

Monday, March 4th, 2013

A New Representation of WordNet® using Graph Databases by Khaled Nagi.

Abstract:

WordNet® is one of the most important resources in computation linguistics. The semantically related database of English terms is widely used in text analysis and retrieval domains, which constitute typical features, employed by social networks and other modern Web 2.0 applications. Under the hood, WordNet® can be seen as a sort of read-only social network relating its language terms. In our work, we implement a new storage technique for WordNet® based on graph databases. Graph databases are a major pillar of the NoSQL movement with lots of emerging products, such as Neo4j. In this paper, we present two Neo4j graph storage representations for the WordNet® dictionary. We analyze their performance and compare them to other traditional storage models. With this contribution, we also validate the applicability of modern graph databases in new areas beside the typical large-scale social networks with several hundreds of millions of nodes.

Finally, a paper that covers “moderate size databases!”

Think about the average graph database you see on this blog. Not really in the “moderate” range, even though a majority of users work in the moderate range.

Compare the number of Facebook size enterprises with the number of enterprises generally.

Not dissing super-sized graph databases or research on same. I enjoy both a lot.

But for your average customer, experience with “moderate size databases” may be more immediately relevant.

I first saw this in a tweet from Peter Neubauer.

### GRADES: Graph Data-management Experiences & Systems

Saturday, December 29th, 2012

GRADES: Graph Data-management Experiences & Systems

Workshop: Sunday June 23, 2013

Papers Due: March 31, 2013

Notification: April 22, 2013

Camera-ready: May 19, 2013

Workshop Scope:

Application Areas

A new data economy is emerging, based on the analysis of distributed, heterogeneous, and complexly structured data sets. GRADES focuses on the problem of managing such data, specifically when data takes the form of graphs that connect many millions of nodes, and the worth of the data and its analysis is not only in the attribute values of these nodes, but in the way these nodes are connected. Specific application areas that exhibit the growing need for management of such graph shaped data include:

• life science analytics, e.g., tracking relationships between illnesses, genes, and molecular compounds.
• social network marketing, e.g., identifying influential speakers and trends propagating through a community.
• digital forensics, e.g., analyzing the relationships between persons and entities for law enforcement purposes.
• telecommunication network analysis, e.g., directed at fixing network bottlenecks and costing of network traffic.
• digital publishing, e.g., enriching entities occurring in digital content with external data sources, and finding relationships among the entities.

Perspectives

The GRADES workshop solicits contributions from two perspectives:

• Experiences. This includes topics that describe use case scenarios, datasets, and analysis opportunities occurring in real-life graph-shaped, ans well as benchmark descriptions and benchmark results.
• Systems. This includes topics that describe data management system architectures for processing of Graph and RDF data, and specific techniques and algorithms employed inside such systems.

The combination of the two (Experiences with Systems) and benchmarking RDF and graph database systems, is of special interest.

Topics Of Interest

The following is a non-exhaustive list describing the scope of GRADES:

• vision papers describing potential applications and benefits of graph data management.
• descriptions of graph data management use cases and query workloads.
• experiences with applying data management technologies in such situations.
• experiences or techniques for specific operations such as traversals or RDF reasoning.
• proposals for benchmarks for data integration tasks (instance matching and ETL techniques).
• proposals for benchmarks for RDF and graph database workloads.
• evaluation of benchmark performance results on RDF or graph database systems.
• system descriptions covering RDF or graph database technology.
• data and index structures that can be used in RDF and graph database systems.
• query processing and optimization algorithms for RDF and graph database systems.
• methods and technique for measuring graph characteristics.
• methods and techniques for visualizing graphs and graph query results.
• proposals and experiences with graph query languages.

The GRADES workshop is co-located and sponsored by SIGMOD in recognition that these problems are only interesting at large-scale and the contribution of the SIGMOD community to handle such topics on large amounts of data of many millions or even billions of nodes and edges is of critical importance.

That sounds promising doesn’t it? (Please email, copy, post, etc.)

### How To Use A Graph Database to Integrate And Analyze Relational Exports

Sunday, July 15th, 2012

From the post:

Graph databases can be used to analyze data from disparate datasources. In this use-case, three relational databases have been exported to CSV. Each relational export is ingested into its own sharded sub-graph to increase performance and avoid lock contention when merging the datasets. Unique keys overlap the datasources to provide the mechanism to link the subgraphs produced from parsing the CSV. A REST server is used to send the merged graph to a visualization application for analysis.

Cleaning out my pending posts file when I ran this one.

Would be a good comparison case for my topic maps class.

Although I would have to do in installation work on a public facing server and leave the class members to do the analysis/uploading.

Hmmm, perhaps split the class into teams, some of which using this method, some using more traditional record linkage and some using topic maps, all on the same data.

Suggestions on data sets that would highlight the differences? Or result in few differences at all? (I suspect both to be true, depending upon the data sets.)

### Search Algorithms for Conceptual Graph Databases

Friday, July 13th, 2012

Search Algorithms for Conceptual Graph Databases by Abdurashid Mamadolimov.

Abstract:

We consider a database composed of a set of conceptual graphs. Using conceptual graphs and graph homomorphism it is possible to build a basic query-answering mechanism based on semantic search. Graph homomorphism defines a partial order over conceptual graphs. Since graph homomorphism checking is an NP-Complete problem, the main requirement for database organizing and managing algorithms is to reduce the number of homomorphism checks. Searching is a basic operation for database manipulating problems. We consider the problem of searching for an element in a partially ordered set. The goal is to minimize the number of queries required to find a target element in the worst case. First we analyse conceptual graph database operations. Then we propose a new algorithm for a subclass of lattices. Finally, we suggest a parallel search algorithm for a general poset.

While I have no objection to efficient solutions for particular cases, as a general rule those solutions are valid for some known set of cases.

Here we appear to have an efficient solution for some unknown number of cases. I mention it to keep in mind while watching the search literature on graph databases develop.

### Titan

Wednesday, May 30th, 2012

Titan

Alpha Release Coming June 5, 2012

From the homepage:

Titan is a distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster. The primary features of Titan are itemized below.

If the names Marko A. Rodriguez or Matthias Broecheler mean anything to you, June 5th can’t come soon enough!

### Data Locality in Graph Databases through N-Body Simulation

Thursday, April 5th, 2012

Data Locality in Graph Databases through N-Body Simulation by Dominic Pacher, Robert Binna, and Günther Specht.

Abstract:

Data locality poses a major performance requirement in graph databases, since it forms a basis for efficient caching and distribution. This vision paper presents a new approach to satisfy this requirement through n-body simulation. We describe our solution in detail and provide a theoretically complexity estimation of our method. To prove our concept, we conducted an evaluation using the DBpedia dataset data. The results are promising and show that n-body simulation is capable to improve data locality in graph databases significantly.

My first reaction was why clustering of nodes wasn’t compared to n-body simulation? That seems like an equally “natural” method to achieve data locality.

My second reaction was that the citation of “…Simulations of the formation, evolution and clustering of galaxies and quasars. nature, 435(7042):629–636, jun 2005. (citation 16 in the article) was reaching in terms of support for scaling. That type of simulation involves a number of simplifying assumptions that aren’t likely to be true for most graphs.

Imaginative work but it needs a little less imagination and a bit rigor in terms of its argument/analysis.

### Graph Databases Make Apps Social

Saturday, March 31st, 2012

Graph Databases Make Apps Social by Adrian Bridgwater.

Neo Technology has suggested that social graph database technology will become a key trend in the data science arena throughout 2012 and beyond. On the back of vindicating comments made by Forrester analyst James Kobielus, the company contends that social graph complexities are needed to meet the high query performance levels now required inside Internet scale cloud applications.

Unsurprisingly a vendor of graph database technology itself (although Neo4j is open source at heart before its commercially supported equivalent), Neo Technology points to social graph capabilities, which take information across a range of networks to understand the relationships between individuals.

Sounds like applications of interest to DoD/DARPA doesn’t it?

### Related

Thursday, March 29th, 2012

Related

From the webpage:

Related

Related is a Redis-backed high performance distributed graph database.

Raison d’être

Related is meant to be a simple graph database that is fun, free and easy to use. The intention is not to compete with “real” graph databases like Neo4j, but rather to be a replacement for a relational database when your data is better described as a graph. For example when building social software. Related is very similar in scope and functionality to Twitters FlockDB, but is among other things designed to be easier to setup and use. Related also has better documentation and is easier to hack on. The intention is to be web scale, but we ultimately rely on the ability of Redis to scale (using Redis Cluster for example). Read more about the philosophy behind Related in the Wiki.

Well, which is it?

A “Redis-backed high performance distributed graph database,”

or

“…not to compete with “real” graph databases like Neo4j….?”

If the intent is to have a “web scale” distributed graph database, then it will be competing with other graph database products.

If you are building a graph database, keep an eye on René Pickhardt’s blog for notices about the next meeting of his graph reading club.

### PhD proposal on distributed graph data bases

Tuesday, March 27th, 2012

PhD proposal on distributed graph data bases by René Pickhardt.

From the post:

Over the last week we had our off campus meeting with a lot of communication training (very good and fruitful) as well as a special treatment for some PhD students called “massage your diss”. I was one of the lucky students who were able to discuss our research ideas with a post doc and other PhD candidates for more than 6 hours. This lead to the structure, todos and time table of my PhD proposal. This has to be finalized over the next couple days but I already want to share the structure in order to make it more real. You might also want to follow my article on a wish list of distributed graph data base technology.

If you have the time, please read René’s proposal and comment on it.

Although I am no stranger to multi-year research projects, ;-), I must admit to pausing when I read:

Here I will name the at least the related work in the following fields:

• graph processing (Signal Collect, Pregel,…)
• graph theory (especially data structures and algorithms)
• (dynamic/adaptive) graph partitioning
• distributed computing / systems (MPI, Bulk Synchronous Parallel Programming, Map Reduce, P2P, distributed hash tables, distributed file systems…)
• redundancy vs fault tolerance
• network programming (protocols, latency vs bandwidth)
• data bases (ACID, multiple user access, …)
• graph data base query languages (SPARQL, Gremlin, Cypher,…)
• Social Network and graph analysis and modelling.

Unless René is planning on taking the most recent citations in each area, describing related work and establishing how it is related to “distributed graph data bases,” will consume the projected time period for his dissertation work.

Each of the areas listed is a complete field unto itself and has many PhD sized research problems related to “distributed graph data bases.”

Almost all PhD proposals start with breath taking scope but the ones that make a real contribution (and are completed), identify specific problems that admit to finite research programs.

I think René should revise his proposal to focus on some particular aspect of “distributed graph data bases.” I suspect even the history of one aspect of such databases will expand fairly rapidly upon detailed consideration.

The need for a larger, global perspective on “distributed graph data bases” will still be around after René finishes a less encompassing dissertation. I promise.

### FDB: A Query Engine for Factorised Relational Databases

Tuesday, March 20th, 2012

FDB: A Query Engine for Factorised Relational Databases by Nurzhan Bakibayev, Dan Olteanu, and Jakub Závodný.

Abstract:

Factorised databases are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. This paper introduces FDB, an in-memory query engine for select-project-join queries on factorised databases. Key components of FDB are novel algorithms for query optimisation and evaluation that exploit the succinctness brought by data factorisation. Experiments show that for data sets with many-to-many relationships FDB can outperform relational engines by orders of magnitude.

It is twelve pages of dense slogging but I wonder if you have a reaction to:

Finally, factorised representations are relational algebra expressions with well-understood semantics. Their relational nature sets them apart from XML documents, object-oriented databases, and nested objects [2], where the goal is to avoid the rigidity of the relational model. (on the second page)

Where [2] is: S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

Online version of Foundations of Databases

DBLP has a nice listing of the references (with links) in Foundations of Databases

Abiteboul and company are cited without a page reference (my printed edition is 685 pages long) and the only comparison that I can uncover between the relational model and any of those mentioned here is that an object-oriented database has oids, which aren’t members of a “printable class” as are keys.

I am not sure what sort of oid isn’t a member of a “printable” class but am willing to leave that to one side for the moment.

My problem is with the characterization “…to avoid the rigidity of the relational model.”

The relational model has been implemented in any number of rigid ways, but is that the fault of a model based on operations on tuples, which can be singletons?

What if factorisation were applied to a graph database, composed of singletons, enabling the use of “…relational algebra expressions with well-understood semantics.”?

It sounds like factorisation could speed up classes of “expected” queries across graph databases. I don’t think anyone creates a database, graph or otherwise, without some classes of queries in mind. The user would be no worse off when they create an unexpected query.