Archive for the ‘Titan’ Category

JanusGraph (Linux Foundation Graph Player Rides Into Town)

Wednesday, February 22nd, 2017

JanusGraph

From the homepage:

JanusGraph is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.

In addition, JanusGraph provides the following features:

You can clone JanusGraph from GitHub.
Read the JanusGraph documentation and join the users or developers mailing lists.

Follow the Getting Started with JanusGraph guide for a step-by-step introduction.

Supported by Google, IBM and Hortonworks, among others.

Three good reasons to pay attention to JanusGraph early and often.

Enjoy!

Visualizing your Titan graph database:…

Friday, June 17th, 2016

Visualizing your Titan graph database: An update by Marco Liberati.

From the post:

Last summer, we wrote a blog with our five simple steps to visualizing your Titan graph database with KeyLines. Since then TinkerPop has emerged from the Apache Incubator program with TinkerPop3, and the Titan team have released v1.0 of their graph database:

  • TinkerPop3 is the latest major reincarnation of the graph proje­­­ct, pulling together the multiple ventures into a single united ecosystem.
  • Titan 1.0 is the first stable release of the Titan graph database, based on the TinkerPop3 stack.

We thought it was about time we updated our five-step process, so here’s:

Not exactly five (5) steps because you have to acquire a KeyLines trial key, etc.

A great endorsement of much improved installation process for TinkerPop3 and Titan 1.0.

Enjoy!

Query the Northwind Database as a Graph Using Gremlin

Wednesday, October 21st, 2015

Query the Northwind Database as a Graph Using Gremlin by Mark Kromer.

From the post:

One of the most popular and interesting topics in the world of NoSQL databases is graph. At DataStax, we have invested in graph computing through the acquisition of Aurelius, the company behind TitanDB, and are especially committed to ensuring the success of the Gremlin graph traversal language. Gremlin is part of the open source Apache TinkerPop graph framework project and is a graph traversal language used by many different graph databases.

I wanted to introduce you to a superb web site that our own Daniel Kuppitz maintains called “SQL2Gremlin” (http://sql2gremlin.com) which I think is great way to start learning how to query graph databases for those of us who come from the traditional relational database world. It is full of excellent sample SQL queries from the popular public domain RDBMS dataset Northwind and demonstrates how to produce the same results by using Gremlin. For me, learning by example has been a great way to get introduced to graph querying and I think that you’ll find it very useful as well.

I’m only going to walk through a couple of examples here as an intro to what you will find at the full site. But if you are new to graph databases and Gremlin, then I highly encourage you to visit the sql2gremlin site for the rest of the complete samples. There is also a nice example of an interactive visualization / filtering, search tool here that helps visualize the Northwind data set as it has been converted into a graph model.

I’ve worked with (and worked for) Microsoft SQL Server for a very long time. Since Daniel’s examples use T-SQL, we’ll stick with SQL Server for this blog post as an intro to Gremlin and we’ll use the Northwind samples for SQL Server 2014. You can download the entire Northwind sample database here. Load that database into your SQL Server if you wish to follow along.

When I first saw the title to this post,

Query the Northwind Database as a Graph Using Gremlin (emphasis added)

I thought this was something else. A database about the Northwind album.

Little did I suspect that the Northwind Database is a test database for SQL Server 2005 and SQL Server 2008. Yikes!

Still, I thought some of you might have access to such legacy software and so I am pointing you to this post. 😉

PSA:

Support for SQL Server 2005 ends April 16, 2016 (that’s next April)

Support for SQL Server 2008 ended July 8, 2014 Ouch! You are more than a year into a dangerous place. Upgrade, migrate or get another job. Hard times are coming and blame will be assigned.

Titan Graph DB Performance Tips

Saturday, October 10th, 2015

Titan Graph DB Performance Tips

From the post:

In Hawkular Inventory, we use the Tinkerpop API (version 2 for the time being) to store our inventory model in a graph database. We chose Titan as the storage engine configured to store the data in the Cassandra cluster that is also backing Hawkular Metrics and Alerts. This blog post will guide you through some performance-related lessons with Titan that we learned so far.

Inventory is under heavy development with a lot of redesign and refactoring going on between releases so we took quite a naive approach to storing and querying data from the graph database. That is, we store entities from our model as vertices in the graph and the relationships between the entities as edges in the graph. Quite simple and a school book example of how it should look like.

We did declare a couple of indices in the database on the read-only aspects of the vertices (i.e. a “type” of the entity the vertex corresponds to) but we actually didn’t pay too much attention to the performance. We wanted to have the model right first.

Fast forward a couple of months and of course, the performance started to be a real problem. The Hawkular agent for Wildfly is inserting a non-trivial amount of entities and not only inserting them but also querying them has seen a huge performance degradation compared to the simple examples we were unit testing with (due to number of vertices and edges stored).

The time has come to think about how to squeeze some performance out of Titan as well as how to store the data and query it more intelligently.

Several performance tips but the one that caught my eye and resulted in an order of magnitude performance gain:

3. Mirror Properties on The Edges

This is the single most important optimization we’ve done so far. The rationale is this. To jump from a vertex to another vertex over an edge is a fairly expensive operation. Titan uses the adjacency lists to store the vertices and their edges in wide rows in Cassandra. It uses another adjacency list for edges and their target vertices.

So to go from vertex to vertex, Titan actually has to do 2 queries. It would be much easier if we could avoid that at least in some cases.

The solution here is to copy the values of some (frequently used and, in our case, immutable) properties from the “source” and “target” vertices directly to the edges. This helps especially in the cases where you do some kind of filtering on the target vertices that you instead can do directly on the edges. If there is a high number of edges to go through, this helps tremendously because you greatly reduce the number of times you have to do the “second hop” to the target vertex.

I am curious what is being stored on the vertex that requires a second search to jump to the target vertex?

That is if you have moved “popular” vertex properties to the edge, why not move other properties of the node there?

Suggestions?

TinkerPop3

Wednesday, July 8th, 2015

TinkerPop3: Taking graph databases and graph analytics to the next level by Matthias Broecheler.

Abstract:

Apache TinkerPop is an open source graph computing framework which includes the graph traversal language Gremlin and a number of graph utilities that speed up the development of graph based applications. Apache TinkerPop provides an abstraction layer on top of popular graph databases like Titan, OrientDB, and Neo4j as well as scalable computation frameworks like Hadoop and Spark allowing developers to build graph applications that run on multiple platforms avoiding vendor lock-in.

This talk gives an overview of the new features introduced in TinkerPop3 with a deep-dive into query language design, query optimization, and the convergence of OLTP and OLAP in graph processing. A demonstration of TinkerPop3 with the scalable Titan graph database illustrates how these concepts work in practice.

It’s not all the information you will need about TinkerPop3 but should be enough to get you interested in learning more, a lot more.

I had a conversation recently on how to process topic maps with graphs, at least if you were willing to abandon the side-effects detailed in the Topic Maps Data Model (TMDM). More on that to follow.

Titan 0.9.0-M2 Release

Tuesday, June 9th, 2015

Titan 0.9.0-M2 Release.

From Dan LaRocque:

Aurelius is pleased to release Titan 0.9.0-M2. 0.9.0-M2 is an experimental release intended for development use.

This release uses TinkerPop 3.0.0.M9-incubating, compared with 3.0.0.M6 in Titan 0.9.0-M1. Source written against Titan 0.5.x and earlier will generally require modification to compile against Titan 0.9.0-M2. As TinkerPop 3 requires a Java 8 runtime, so too does Titan 0.9.0-M2.

While 0.9.0-M1 came out with a separate console and server zip archive, 0.9.0-M2 is a single zipfile with both components. The zipfile is still only packaged with Hadoop 1 to match TP3’s Hadoop support.

http://s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip

Documentation:

Manual: http://s3.thinkaurelius.com/docs/titan/0.9.0-M2/
Javadoc: http://titan.thinkaurelius.com/javadoc/0.9.0-M2/

The upgrade instructions and changelog for 0.9.0-M2 are in the usual places.

http://s3.thinkaurelius.com/docs/titan/0.9.0-M2/upgrade.html
http://s3.thinkaurelius.com/docs/titan/0.9.0-M2/changelog.html

I have to limit my reading of people who pretend that C/C+ level hacks (OPM) are “…the work of most sophisticated state-sponsored cyber intrusion entities.”

Enjoy!

Titan 0.5.4 Release!

Thursday, February 19th, 2015

Titan 0.5.4 Release! by Dan LaRocque.

From the post:

We’re pleased to announce the release of Titan 0.5.4.

This is mostly a bugfix release. It also includes property read optimization.

The zip archives:

http://s3.thinkaurelius.com/downloads/titan/titan-0.5.4-hadoop1.zip
http://s3.thinkaurelius.com/downloads/titan/titan-0.5.4-hadoop2.zip

The documentation:

Manual: http://s3.thinkaurelius.com/docs/titan/0.5.4/
Javadoc: http://titan.thinkaurelius.com/javadoc/0.5.4/

The 0.5.4 release is compatible with earlier releases in the 0.5 series. There are no user-facing API changes and no storage changes between 0.5.3 and this release. For upgrades from 0.5.2 and earlier, consider the upgrade notes about minor API changes:

http://s3.thinkaurelius.com/docs/titan/0.5.4/upgrade.html

The changelog contains a bit more information about what’s new in this
release:

http://s3.thinkaurelius.com/docs/titan/0.5.4/changelog.html

We are indebted to the community for valuable bug and pain point reports that shaped 0.5.4.

Bugfix only or not, users in the United States will welcome any distraction from the current cold wave! 😉

Weaver (Graph Store)

Sunday, December 21st, 2014

Weaver (Graph Store)

From the homepage:

A scalable, fast, consistent graph store

Weaver is a distributed graph store that provides horizontal scalability, high-performance, and strong consistency.

Weaver enables users to execute transactional graph updates and queries through a simple python API.

Alpha release but I did find some interesting statements in the FAQ:

Weaver is designed to store dynamic graphs. You can perform transactions on rapidly evolving graph-structured data with high throughput.

Examples of dynamic graphs?

Think online social networks, WWW, knowledge graphs, Bitcoin transaction graphs, biological interaction networks, etc. If your application manipulates graph-structured data similar to these examples, you should try Weaver out!

High throughput?

Our preliminary experiments show that Weaver achieves over 12x higher throughput than Titan on an online social network workload similar to that of Tao. In addition, Weaver also achieves 4x lower latency than GraphLab on an offline, graph traversal workload.

Alpha release has binaries for Ubuntu 14.04, the is a discussion list and the source code is on GitHub. Weaver has a native C++ binding and a Python client.

Impressive enough statements to start following the discussion group and to compile for Ubuntu 12.04 (yeah, I need to upgrade in the new year).

PS: There are only two messages in the discussion group since this is its first release. Get in on the ground floor!

The Path Forward (Titan 1.0 and TinkerPop 3.0)

Tuesday, December 9th, 2014

The Path Forward by Marko Rodriguez.

A good overview of Titan 1.0 and TinkerPop 3.0. Marko always makes great slides.

I appreciate mythology as an example but it would be nice to see an example of Titan/TinkerPop used in anger.

With the limitation that the data be legally accessible (sorry) what would you suggest as a great example of using Titan/TinkerPop?

Since everyone likes mobile phone apps, I would suggest one that displays a street map and as you pass street addresses, it lights up the address as blue or red depending on their political contributions. Brighter colors for larger donations.

I think that would prove to be very popular.

Would that be a good example for Titan/TinkerPop?

What’s yours?

Titan 0.5 Released!

Saturday, August 16th, 2014

Titan 0.5 Released!

From the Titan documentation:

1.1. General Titan Benefits

  • Support for very large graphs. Titan graphs scale with the number of machines in the cluster.
  • Support for very many concurrent transactions and operational graph processing. Titan’s transactional capacity scales with the number of machines in the cluster and answers complex traversal queries on huge graphs in milliseconds.
  • Support for global graph analytics and batch graph processing through the Hadoop framework.
  • Support for geo, numeric range, and full text search for vertices and edges on very large graphs.
  • Native support for the popular property graph data model exposed by Blueprints.
  • Native support for the graph traversal language Gremlin.
  • Easy integration with the Rexster graph server for programming language agnostic connectivity.
  • Numerous graph-level configurations provide knobs for tuning performance.
  • Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous super node problem.
  • Provides an optimized disk representation to allow for efficient use of storage and speed of access.
  • Open source under the liberal Apache 2 license.

A major milestone in the development of Titan!

If you are interested in serious graph processing, Titan is one of the systems that should be on your short list.

PS: Matthias Broecheler has posted Titan 0.5.0 GA Release, which has links to upgrade instructions and comments about a future Titan 1.0 release!

MusicGraph

Tuesday, July 29th, 2014

Senzari Unveils MusicGraph.ai At The GraphLab Conference 2014

From the post:

Senzari introduced MusicGraph.ai, the first web-based graph analytics and intelligence engine for the music industry at the GraphLab Conference 2014, the annual gathering of leading data scientists and machine learning experts. MusicGraph.ai will serve as the primary dashboard for MusicGraph, where API clients will be able to view detailed reports on their API usage and manage their account. More importantly, through this dashboard, they will also be able to access a comprehensive library of algorithms to extract even more value from the world’s most extensive repository of music data.

“We believe MusicGraph.ai will forever change the music intelligence industry, as it allows scientists to execute powerful analytics and machine learning algorithms at scale on a huge data-set without the need to write a single-line of code”

Free access to MusicGraph at: http://developer.musicgraph.com

I originally encountered MusicGraph because of its use of the Titan graph database. BTW, GraphLab and GraphX are also available for data analytics.

From the MusicGraph website:

MusicGraph is the world’s first “natural graph” for music, which represents the real-world structure of the musical universe. Information contained within it includes data related to the relationship between millions of artists, albums, and songs. Also included is detailed acoustical and lyrical features, as well as real-time statistics across artists and their music across many sources.

MusicGraph has over 600 million vertices and 1 billion edges, but more importantly it has over 7 billion properties, which allows for deep knowledge extraction through various machine learning approaches.

Sigh, why can’t people say: “…it represents a useful view of the musical universe…,” instead of “…which represents the real-world structure of the musical universe”? All representations are views of some observer. (full stop) If you think otherwise, please return your college and graduate degrees for a refund.

Yes, I know political leaders use “real world” all the time. But they are trying to deceive you into accepting their view as beyond question because it represents the “real world.” Don’t be deceived. Their views are no “real world” based than yours are. Which is to say, not at all. Defend your view but knowing it is a view.

I first saw this in a tweet by Gregory Piatetsky.

Powers of Ten – Part II

Monday, June 2nd, 2014

Powers of Ten – Part II by Stephen Mallette.

From the post:

“‘Curiouser and curiouser!’ cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English); ‘now I’m opening out like the largest telescope that ever was!”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is sometimes surprising to see just how much data is available. Much like Alice and her sudden increase in height, in Lewis Carroll’s famous story, the upward growth of data can happen quite quickly and the opportunity to produce a multi-billion edge graph becomes immediately present. Luckily, Titan is capable of scaling to accommodate such size and with the right strategies for loading this data, the development efforts can more rapidly shift to the rewards of massive scale graph analytics.

This article represents the second installment in the two part Powers of Ten series that discusses bulk loading data into Titan at varying scales. For purposes of this series, the “scale” is determined by the number of edges to be loaded. As it so happens, the strategies for bulk loading tend to change as the scale increases over powers of ten, which creates a memorable way to categorize different strategies. “Part I” of this series, looked at strategies for loading millions and tens of millions of edges and focused on usage of Gremlin to do so. This part of the series will focus on hundreds of millions and billions of edges and will focus on the usage of Faunus as the loading tool.

Note: By Titan 0.5.0, Faunus will be pulled into the Titan project under the name Titan/Hadoop.

Scaling to graph processing to hundreds of millions and billions of edges.

Deeply interesting work but I am left with multiple questions:

  • Hundreds of millions and billions of edges, to load. Any other graph metrics? Traversal for example?
  • Does loading performance scale with more servers? Instead of m2.4xlarge EC2 instances, what is the performance with 8x?
  • What kind of knob tuning was useful with a social network dataset?

I am sure there are other questions but those are the first ones that came to mind.

Powers of Ten – Part I

Saturday, May 31st, 2014

Powers of Ten – Part I by Stephen Mallette.

From the post:

“No, no! The adventures first,’ said the Gryphon in an impatient tone: ‘explanations take such a dreadful time.”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is often quite simple to envision the benefits of using Titan. Developing complex graph analytics over a multi-billion edge distributed graph represent the adventures that await. Like the Gryphon from Lewis Carroll’s tale, the desire to immediately dive into the adventures can be quite strong. Unfortunately and quite obviously, the benefits of Titan cannot be realized until there is some data present within it. Consider the explanations that follow; they are the strategies by which data is bulk loaded to Titan enabling the adventures to ensue.

There are a number of different variables that might influence the approach to loading data into a graph, but the attribute that provides the best guidance in making a decision is size. For purposes of this article, “size” refers to the estimated number of edges to be loaded into the graph. The strategy used for loading data tends to change in powers of ten, where the strategy for loading 1 million edges is different than the approach for 10 million edges.

Given this neat and memorable way to categorize batch loading strategies, this two-part article outlines each strategy starting with the smallest at 1 million edges or less and continuing in powers of ten up to 1 billion and more. This first part will focus on 1 million and 10 million edges, which generally involves common Gremlin operations. The second part will focus on 100 million and 1 billion edges, which generally involves the use of Faunus.

Great guidance on loading relatively small data sets using Gremlin. Looking forward to seeing the harder tests with 100 million and 1 billion edge sets.

Titan 0.4.4 / Faunus 0.4.4

Tuesday, April 22nd, 2014

I saw a tweet earlier today from aurelius that Titan 0.44 and Faunus 0.4.4 are available.

Grab your copy at:

Faunus Downloads

Titan Downloads

Enjoy!

Plato, Shiva and A Social Graph

Monday, April 21st, 2014

The Social Graph of the Los Alamos National Laboratory by Marko A. Rodriguez.

From the post:

The web is composed of numerous web sites tailored to meet the information, consumption, and social needs of its users. Within many of these sites, references are made to the same platonic “thing” though different facets of the thing are expressed. For example, in the movie industry, there is a movie called John Carter by Disney. While the movie is an abstract concept, it has numerous identities on the web (which are technically referenced by a URI).

Aurelius collaborated with the Digital Library Research and Prototyping Group of the Los Alamos National Laboratory (LANL) to develop EgoSystem atop the distributed graph database Titan. The purpose of this system is best described by the introductory paragraph of the April 2014 publication on EgoSystem.

I heavily commend Marko’s post and the Egosystem publication for your reading. That despite my cautions concerning some of the theoretical aspects of the project.

Statements like:

references are made to the same platonic “thing” though different facets of the thing are expressed.

have always troubled me. In part because it involves a claim, usually by the speaker, to have freed themselves from Plato’s cave such that they and they alone can see things aright. Which consigns the rest of us to be the pitiful lot still confined to the cave.

Which of course leads to Marko’s:

There are two categories of vertices in EgoSystem.

  1. Platonic: Denotes an abstract concept devoid of interpretation.
  2. Identity: Denotes a particular interpretation of a platonic.

Every platonic vertex is of a particular type: a person, institution, artifact, or concept. Next, every platonic has one or more identities as referenced by a URL on the web. The platonic types and the location of their web identities are itemized below. As of EgoSystem 1.0, these are the only sources from which data is aggregated, though extending it to support more services (e.g. Facebook, Quorum, etc.) is feasible given the system’s modular architecture.

A structure where English labels, remarkably enough, are places on “Platonic” vertices. Not that we would attribute any identity or semantics to a “Platonic” vertex. 😉

Rather than “Platonic” vertices, they are better described as boundary vertices. That is they circumscribe what can be represented in a particular graph, without making claims on a “higher” reality.

I say that not to be pedantic but to illustrate how a “Platonic” vertex prevents us from meaningful merger with graphs with differing “Platonic” vertices.

No doubt Shiva’s1 other residence, Arzamas-16, could benefit from a similar “alumni” graph but I rather doubt it is going to use English labels for its “Platonic” vertices which:

Denote[…] an abstract concept devoid of interpretation.

If I have no “interpretation,” which I takes to mean no properties (key/value pairs), how will I combine social graphs from Los Alamos and Arzamas-16?

I could cheat and secretly look up properties for the alleged “Platonic” nodes and combine them together but then how would you check my work? The end result would be opaque to anyone other than myself.

That isn’t a criticism of using the EgoSystem. I am sure it meets the needs of Los Alamos quite nicely.

However, it can prevent us from capturing the information necessary to expand the boundary of our graph at some future date or merging it with other graphs.

From a philosophical standpoint, we should not claim access to Platonic ideals when we are actually recording our views of shadows on the cave wall. Of which, intersections between graphs/shadows are just a subset.

1. Those of you old enough to remember Robert Oppenheimer will recognize the reference.

Titan: Scalable Graph Database

Tuesday, April 15th, 2014

Titan: Scalable Graph Database by Matthias Broecheler.

Conference presentation so long on imagery but short on detail. 😉

However, useful to walk your manager through as a pitch for support to investigate further.

When that support is given, check out: http://thinkaurelius.github.io/titan/. Links to source code, other resources, etc.

Forbes on Graphs

Thursday, February 13th, 2014

Big Data Solutions Through The Combination Of Tools by Ben Lorica.

From the post:

As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark(1) and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark(2). Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.

Another recent example is Dendrite(3) – an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:

Another contender in the graph space!

Interesting that Spark comes up a second time for today.

Having Forbes notice a technology gives it credence don’t you think?

I first saw this in a tweet by aurelius.

Faunus & Titan 0.4.2 Released

Friday, January 10th, 2014

Faunus & Titan 0.4.2 Released by Dan LaRocque.

From the post:

Aurelius is pleased to announce the release of Titan and Faunus 0.4.2.

This is mainly a bugfix release. Of particular note is a pair of Titan bugs involving deletion of edges with multiple properties and of edges labeled with reverse-ordered sort keys. Titan also gets a few new configuration options and expanded Metrics coverage in this release.

Downloads:

* Titan: https://github.com/thinkaurelius/titan/wiki/Downloads#titan-042
*Faunus: https://github.com/thinkaurelius/faunus/wiki/Downloads

Something for your weekend!

…Titan Cluster on Cassandra and ElasticSearch on AWS EC2

Saturday, December 21st, 2013

Setting up a Titan Cluster on Cassandra and ElasticSearch on AWS EC2 by Jenny Kim.

From the post:

This purpose of this post is to provide a walkthrough of a Titan cluster setup and highlight some key gotchas I’ve learned along the way. This walkthrough will utilize the following versions of each software package:

Versions

The cluster in this walkthrough will utilize 2 M1.Large instances, which mirrors our current Staging cluster setup. A typical production graph cluster utilizes 4 M1.XLarge instances.

NOTE: While the Datastax Community AMI requires at minimum, M1.Large instances, the exact instance-type and cluster size should depend on your expected graph size, concurrent requests, and replication and consistency needs.

Great post!

You will be gaining experience with cloud computing along with very high end graph software (Titan).

Tittel [Merry Christmas Marko!]

Thursday, December 19th, 2013

Intel Goes Graph with Hadoop Distro by Alex Woodie.

From the post:

Intel will be targeting big retail operations with a new graph database that it unveiled today as part of its Intel Distribution for Apache Hadoop version 3 announcement. The graph engine will enable customers to make product or customer recommendations in real time, a la Netflix or Amazon, based on existing data. The chip giant also fleshed out its Hadoop distro with a 20x speedup in encryption functions, a data tokenization option, and a handful of new machine learning algorithms aimed at solving common problems.

Intel got its feet wet with graph analytics a year ago when it released into the open source arena Graph Builder, a set of libraries designed to help developers create graphs based on real world models. Since that first alpha release, Intel developers have streamlined the software and made it easier for users to import, clean, and transform large amounts of data sitting in the graph database. These enhancements will ship in early 2014 as Intel Graph Builder for Apache Hadoop software version 2.

Intel Graph Builder is based on the open source Titan distributed graph database, and uses Pig scripts to trigger queries on top of the graph, says Ritu Kama, director of product management in Intel’s Big Data group. The graph engine adds another analytical option for Intel Hadoop customers, in addition to MapReduce, HBase, Hive, and Mahout, which are all bundled with the distribution.

Yes, Titan, whose development has been lead by Marko A. Rodriguez.

I can’t think of a better Christmas present!

Will Tittel be the successor to Wintel?

When you tire of the shallow end of the graph pool, you can answer that question for yourself with Titan and/or the Intel® Distribution.

PS: The download page says:

Download the Intel® Distribution to experience the power of hardware assisted security & enterprise grade performance for Apache Hadoop* big data processing. This 100% Apache Hadoop* open source download delivers core project capabilities with value added Intel® Manager: auto-tuning for hadoop clusters, role based access control for HBase, multi-site scalability and adaptive replication in HBase, and many other features to ease deployment of Hadoop in the enterprise. After registration you will be presented to download TAR or Virtual Machine versions, gain access to online help documentation, and receive a link to Community Forums.

It’s 90 day unrestricted evaluation software.

I’m going to wait until after the holidays to grab a copy.

…Graph Analytics

Wednesday, December 11th, 2013

Big Data in Security – Part III: Graph Analytics by Levi Gundert.

In interview form with Michael Howe and Preetham Raghunanda.

You will find two parts of the exchange particularly interesting:

You mention very large technology companies, obviously Cisco falls into this category as well — how is TRAC using graph analytics to improve Cisco Security products?

Michael: How we currently use graph analytics is an extension of the work we have been doing for some time. We have been pulling data from different sources like telemetry and third-party feeds in order to look at the relationships between them, which previously required a lot of manual work. We would do analysis on one source and analysis on another one and then pull them together. Now because of the benefits of graph technology we can shift that work to a common view of the data and give people the ability to quickly access all the data types with minimal overhead using one tool. Rather than having to query multiple databases or different types of data stores, we have a polyglot store that pulls data in from multiple types of databases to give us a unified view. This allows us two avenues of investigation: one, security investigators now have the ability to rapidly analyze data as it arrives in an ad hoc way (typically used by security response teams) and the response times dramatically drop as they can easily view related information in the correlations. Second are the large-scale data analytics. Folks with traditional machine learning backgrounds can apply algorithms that did not work on previous data stores and now they can apply those algorithms across a well-defined data type – the graph.

For intelligence analysts, being able to pivot quickly across multiple disparate data sets from a visual perspective is crucial to accelerating the process of attribution.

Michael: Absolutely. Graph analytics is enabling a much more agile approach from our research and analysis teams. Previously when something of interest was identified there was an iterative process of query, analyze the results, refine the query, wash, rinse, and repeat. This process moves from taking days or hours down to minutes or seconds. We can quickly identify the known information, but more importantly, we can identify what we don’t know. We have a comprehensive view that enables us to identify data gaps to improve future use cases.

Did you catch the “…to a common view of the data…” caveat In the third sentence of Michael’s first reply.

Not to deny the usefulness of Titan (the graph solution being discussed) but to point out that current graphs require normalization of data.

For Cisco, that is a winning solution.

But then Cisco can use a closed solution based on normalized data.

Importing, analyzing and then returning results to heterogeneous clients could require a different approach.

Or if you have legacy data that spans centuries.

Or even agencies, departments, or work groups.

Instructions for deploying an Elasticsearch Cluster with Titan

Friday, December 6th, 2013

Instructions for deploying an Elasticsearch Cluster with Titan by Benjamin Bengfort.

From the post:

Elasticsearch is an open source distributed real-time search engine for the cloud. It allows you to deploy a scalable, auto-discovered cluster of nodes, and as search capacity grows, you simple need to add more nodes and the cluster will reorganize itself. Titan, a distributed graph engine by Aurelius supports elasticsearch as an option to index your vertices for fast lookup and retrieval. By default, Titan supports elasticsearch running in the same JVM and storing data locally on the client, which is fine for embedded mode. However, once your Titan cluster starts growing, you have to respond by growing an elasticsearch cluster side by side with the graph engine.

This tutorial is how to quickly get a elasticsearch cluster up and running on EC2, then configuring Titan to use it for indexing. It assumes you already have an EC2/Titan cluster deployed. Note, that these instructions were for a particular deployment, so please forward any questions about specifics in the comments!

A great tutorial. Short, on point and references other resources.

Enjoy!

MusicGraph

Wednesday, December 4th, 2013

Senzari releases a searchable MusicGraph service for making musical connections by Josh Ong.

From the post:

Music data company Senzari has launched MusicGraph, a new service for discovering music by searching through graph of over a billion music-related data points.

MusicGraph includes a consumer-facing version and an API that can be used for commercial purposes. Senzari built the graph while working on the recommendation engine for its own streaming service, which has been rebranded as Wahwah.

Interestingly, MusicGraph is launching first on Firefox OS before coming to iOS, Android and Windows Phone in “the coming weeks.”

You know how much I try to avoid “practical” applications but when I saw aureliusgraphs tweet this as using the Titan database, I just had to mention it. 😉

I think this announcement underlines something a comment said recently about promoting topic maps for what they do, not because they are topic maps.

Here, graphs are being promoted as the source of a great user experience, not because they are fun, powerful, etc. (all of which is also true).

Boutique Graph Data with Titan

Wednesday, November 27th, 2013

Boutique Graph Data with Titan by Marko A. Rodriguez.

From the post:

Titan is a distributed graph database capable of supporting graphs on the order of 100 billion edges and sustaining on the order of 1 billion transactions a day (see Educating the Planet with Pearson). Software architectures that leverage such Big Graph Data typically have 100s of application servers traversing a distributed graph represented across a multi-machine cluster. These architectures are not common in that perhaps only 1% of applications written today require that level of software/machine power to function. The other 99% of applications may only require a single machine to store and query their data (with a few extra nodes for high availability). Such boutique graph applications, which typically maintain on the order of 100 million edges, are more elegantly served by Titan 0.4.1+. In Titan 0.4.1, the in-memory caches have been advanced to support faster traversals which makes Titan’s single-machine performance comparable to other single machine-oriented graph databases. Moreover, as the application scales beyond the confines of a single machine, simply adding more nodes to the Titan cluster allows boutique graph applications to seamlessly grow to become Big Graph Data applications (see Single Server to Highly Available Cluster).

A short walk on the technical side of Titan.

I would replace “boutique” with “big data” and say Titan allows customers to seamlessly transition from “big data” to “bigger data.”

Having “big data” is like having a large budget under your control.

What matters is the user is the status of claiming to possess it.

Let’s not disillusion them. 😉

Titan 4.1 Release

Monday, November 25th, 2013

Titan 4.1 Release

From the release notes:

Tested Compatibility:

  • Cassandra 1.2.2
  • HBase 0.94.12
  • BerkeleyJE 5.0.73
  • Elasticsearch 0.90.5
  • Lucene 4.4.0
  • Persistit 3.3.0
  • Java 1.7+ (partially compatible with Java 1.6)

Features:

  • Property pre-fetching to speed up multiple property lookups per vertex. Configurable through fast-property option.
  • Shortened HBase column-family names to reduce the HBase storage footprint. This feature is disabled by default for backwards-compatibility. Enable it via storage.short-cf-names
  • Metrics per Transaction: Gathering measurements on the transaction level and group them by transaction template name configurable through graph.buildTransaction().setMetricsPrefix(String)
  • Metrics Ganglia and Graphite support
  • Improvements to the internal memory structures and algorithms of a Titan transaction which lead to much improved traversal times (a lot of credit goes to Pavel for these optimizations!!)
  • Added database level cache for lower latency query answering against warm data. Enable via cache.db-cache. Learn more about Database Cache.
  • Better caching implementation for relations (RelationCache) to provide faster de-serialization performance
  • Addition of a new query optimizer that can significantly speed up a subset of traversals
  • Support for reverse ordering in vertex centric queries by defining: makeLabel(..).sortKey(..).sortOrder(Order.DESC).make()
  • Support for index configuration parameters passed into KeyMaker.indexed(String,Class,Parameter…) to change the default indexing behavior of an indexing backend.
  • Support for TEXT and STRING mapping of strings in both Lucene and ElasticSearch configurable as a parameter. Learn more about Full Text and String Search
  • Refactored Text.REGEX/PREFIX to Text.CONTAINS_REGEX/CONTAINS_PREFIX to accurately reflect their semantics. Added Text.REGEX/PREFIX for full string matching. See Indexing Backend Overview
  • Added support for scaling the id allocation to hundreds of parallel Titan instances through additional configuration options. See Bulk Loading.

Bugfixes:

  • Fixed multiQuery() for specific has() conditions. Added support for multiQuery(Collection).
  • Fixed limit adjustment issue for unconstraint IN/OUT queries
  • Fixed packaging issues
  • Fixed cache misses due to wrong limit interpretation

Looks like it is time for an upgrade!

BTW, this is an experimental version so NSFP (Not Safe For Production).

Titan 4.1 wikidoc.

Download Titan 4.1.

Using AWS to Build a Graph-based…

Friday, November 22nd, 2013

Using AWS to Build a Graph-based Product Recommendation System by Andre Fatala and Renato Pedigoni.

From the description:

Magazine Luiza, one of the largest retail chains in Brazil, developed an in-house product recommendation system, built on top of a large knowledge Graph. AWS resources like Amazon EC2, Amazon SQS, Amazon ElastiCache and others made it possible for them to scale from a very small dataset to a huge Cassandra cluster. By improving their big data processing algorithms on their in-house solution built on AWS, they improved their conversion rates on revenue by more than 25 percent compared to market solutions they had used in the past.

Not a lot of technical details but a good success story to repeat if you are pushing graph-based services.

I first saw this in a tweet by Marko A. Rodriguez.

Faunus & Titan 0.4.0 Released

Wednesday, October 16th, 2013

Faunus & Titan 0.4.0 Released by Dan LaRocque.

Dan’s post:

Aurelius is pleased to announce the release of Titan and Faunus 0.4.0.

This is a new major release which changes Titan’s client API, internal architecture, and storage format, and as such should be considered non-stable for now.

Downloads:

* https://github.com/thinkaurelius/titan/wiki/Downloads#titan-040-experimental-release

* https://github.com/thinkaurelius/faunus/wiki/Downloads

The artifacts have propagated to Maven Central, though they have yet to appear in the search index on search.maven.org.

New Titan features:

* MultiQuery, which speeds up traversal queries by an order of magnitude for common branching factors

* Initial Fulgora release with the introduction of an in-memory storage backend for Titan based on Hazelcast

* A new Persistit backend (special thanks to Blake Eggleston)

* Completely refactored query optimization and execution framework which makes query answering faster – in particular for GraphQuery

* Metrics integration for monitoring

* additional GraphQuery primitives and support in ElasticSearch and Lucene

* refactoring and deeper testing of the standard locking implementation

* redesigned type definition API

* much more

Titan 0.4.0 uses a new storage format which is incompatible with older versions of Titan. It also introduces backwards-incompatible API changes around type definition.

Titan release notes:

https://github.com/thinkaurelius/titan/wiki/Release-Notes#version-040-october-16-2013

Titan upgrade instructions:

https://github.com/thinkaurelius/titan/wiki/Upgrade-Instructions#version-040-october-16-2013

New Faunus features:

* Added FaunusRexsterExecutorExtension which allows remote execution of a Faunus script and tracking of its progress

* Global GremlinFaunus variables are now available in ScriptEngine use cases

* Simplified ResultHookClosure with new Gremlin 2.4.0 classes

* The variables hdfs and local are available to `gremlin.sh -e`

Faunus release notes:

https://github.com/thinkaurelius/faunus/wiki/Release-Notes

Both Faunus and Titan now support version 2.4.0 of the Tinkerpop stack, including Blueprints.

Both Faunus and Titan now require Java 7.

Thanks to everybody who contributed code and reported bugs in the 0.3.x series and helped us improve this release.

Enjoy!

Titanium

Wednesday, October 16th, 2013

Titanium

From the homepage:

Clojure library for using the Titan graph database, built on top of Archimedes and Ogre.

The Get Started! page is slightly more verbose:

This guide is meant to provide a quick taste of Titanium and all the power it provides. It should take about 10 minutes to read and study the provided code examples. The contents include:

  • What Titanium is
  • What Titanium is not
  • Clojure and Titan version requirements
  • How to include Titanium in your project
  • A very brief introduction to graph databases
  • How to create vertices and edges
  • How to find vertices again
  • How to execute simple queries
  • How to remove objects
  • Graph theory for smug lisp weenies

You may also like:

Read doc guides

Join the Mailing List (Google group)

An empirical comparison of graph databases

Friday, September 13th, 2013

An empirical comparison of graph databases by Salim Jouili and Valentin Vansteenberghe.

Abstract:

In recent years, more and more companies provide services that can not be anymore achieved efficiently using relational databases. As such, these companies are forced to use alternative database models such as XML databases, object-oriented databases, document-oriented databases and, more recently graph databases. Graph databases only exist for a few years. Although there have been some comparison attempts, they are mostly focused on certain aspects only.

In this paper, we present a distributed graph database comparison framework and the results we obtained by comparing four important players in the graph databases market: Neo4j, OrientDB, Titan and DEX.

(Salim Jouili and Valentin Vansteenberghe, An empirical comparison of graph databases. To appear in Proceedings of the 2013 ASE/IEEE International Conference on Big Data, Washington D.C., USA, September 2013.)

For your convenience:

DEX

Neo4j

OrientDB

Titan

I won’t reproduce the comparison graphs here. The “winner” depends on your requirements.

Looking forward to seeing this graph benchmark develop!

STEFFI…

Monday, September 9th, 2013

STEFFI – Scalable Traversal Engine For Fast In-memory graphDB

From the webpage:

STEFFI is a distributed graph database fully in-memory and amazingly fast when it comes to querying large datasets.

As a scalable graph database, STEFFI’s performance can directly be compared to Neo4j and Titan. It provides its users with a clear competitive advantage when it comes to complicated traversal operations on large datasets. Speedups of up to 200 have been observed when comparing STEFFI whith its alternatives.

More than an alternative to existing solutions, STEFFI opens up new possibilities for high-performance graph storage and manipulation.

Main features

  • in-memory storage for a fast random access
  • distributed parallel computing for high-speed graph queries
  • graph traversal engine for graph processing
  • scalability for a growing data
  • implementing the Blueprints API from tinkerpop for an enchanced accessibility

Recommended for

  • fast recommendation engines (e-commerce, telecommunications, finance, …)
  • large biological networks analysis (biopharma, healthcare, … )
  • security networks management & real-time fraud detection (bank, public institutions, …)
  • complex network & data center management (telecommunications, e-commerce, …)
  • and much more!

Availability

STEFFI is currently in its incubation phase within EURA NOVA. Once the code is mature and stable enough, STEFFI will be provided via this website under the Apache Licence Version 2. If you would like to know more about this project evolution, do not hesitate to subscribe to our mailing list or contact EURA NOVA.

I haven’t run the performance tests personally against Neo4j and Titan but the reported performance gains (200X and 150X, respectively) are impressive.

BTW, you probably want the paper that lead to STEFFI, imGraph: A distributed in-memory graph database by Salim Jouili and Aldemar Reynaga.