Archive for the ‘Faunus’ Category

Powers of Ten – Part II

Monday, June 2nd, 2014

Powers of Ten – Part II by Stephen Mallette.

From the post:

“‘Curiouser and curiouser!’ cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English); ‘now I’m opening out like the largest telescope that ever was!”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is sometimes surprising to see just how much data is available. Much like Alice and her sudden increase in height, in Lewis Carroll’s famous story, the upward growth of data can happen quite quickly and the opportunity to produce a multi-billion edge graph becomes immediately present. Luckily, Titan is capable of scaling to accommodate such size and with the right strategies for loading this data, the development efforts can more rapidly shift to the rewards of massive scale graph analytics.

This article represents the second installment in the two part Powers of Ten series that discusses bulk loading data into Titan at varying scales. For purposes of this series, the “scale” is determined by the number of edges to be loaded. As it so happens, the strategies for bulk loading tend to change as the scale increases over powers of ten, which creates a memorable way to categorize different strategies. “Part I” of this series, looked at strategies for loading millions and tens of millions of edges and focused on usage of Gremlin to do so. This part of the series will focus on hundreds of millions and billions of edges and will focus on the usage of Faunus as the loading tool.

Note: By Titan 0.5.0, Faunus will be pulled into the Titan project under the name Titan/Hadoop.

Scaling to graph processing to hundreds of millions and billions of edges.

Deeply interesting work but I am left with multiple questions:

  • Hundreds of millions and billions of edges, to load. Any other graph metrics? Traversal for example?
  • Does loading performance scale with more servers? Instead of m2.4xlarge EC2 instances, what is the performance with 8x?
  • What kind of knob tuning was useful with a social network dataset?

I am sure there are other questions but those are the first ones that came to mind.

Powers of Ten – Part I

Saturday, May 31st, 2014

Powers of Ten – Part I by Stephen Mallette.

From the post:

“No, no! The adventures first,’ said the Gryphon in an impatient tone: ‘explanations take such a dreadful time.”
    — Lewis CarrollAlice’s Adventures in Wonderland

It is often quite simple to envision the benefits of using Titan. Developing complex graph analytics over a multi-billion edge distributed graph represent the adventures that await. Like the Gryphon from Lewis Carroll’s tale, the desire to immediately dive into the adventures can be quite strong. Unfortunately and quite obviously, the benefits of Titan cannot be realized until there is some data present within it. Consider the explanations that follow; they are the strategies by which data is bulk loaded to Titan enabling the adventures to ensue.

There are a number of different variables that might influence the approach to loading data into a graph, but the attribute that provides the best guidance in making a decision is size. For purposes of this article, “size” refers to the estimated number of edges to be loaded into the graph. The strategy used for loading data tends to change in powers of ten, where the strategy for loading 1 million edges is different than the approach for 10 million edges.

Given this neat and memorable way to categorize batch loading strategies, this two-part article outlines each strategy starting with the smallest at 1 million edges or less and continuing in powers of ten up to 1 billion and more. This first part will focus on 1 million and 10 million edges, which generally involves common Gremlin operations. The second part will focus on 100 million and 1 billion edges, which generally involves the use of Faunus.

Great guidance on loading relatively small data sets using Gremlin. Looking forward to seeing the harder tests with 100 million and 1 billion edge sets.

Titan 0.4.4 / Faunus 0.4.4

Tuesday, April 22nd, 2014

I saw a tweet earlier today from aurelius that Titan 0.44 and Faunus 0.4.4 are available.

Grab your copy at:

Faunus Downloads

Titan Downloads

Enjoy!

Faunus & Titan 0.4.2 Released

Friday, January 10th, 2014

Faunus & Titan 0.4.2 Released by Dan LaRocque.

From the post:

Aurelius is pleased to announce the release of Titan and Faunus 0.4.2.

This is mainly a bugfix release. Of particular note is a pair of Titan bugs involving deletion of edges with multiple properties and of edges labeled with reverse-ordered sort keys. Titan also gets a few new configuration options and expanded Metrics coverage in this release.

Downloads:

* Titan: https://github.com/thinkaurelius/titan/wiki/Downloads#titan-042
*Faunus: https://github.com/thinkaurelius/faunus/wiki/Downloads

Something for your weekend!

Boutique Graph Data with Titan

Wednesday, November 27th, 2013

Boutique Graph Data with Titan by Marko A. Rodriguez.

From the post:

Titan is a distributed graph database capable of supporting graphs on the order of 100 billion edges and sustaining on the order of 1 billion transactions a day (see Educating the Planet with Pearson). Software architectures that leverage such Big Graph Data typically have 100s of application servers traversing a distributed graph represented across a multi-machine cluster. These architectures are not common in that perhaps only 1% of applications written today require that level of software/machine power to function. The other 99% of applications may only require a single machine to store and query their data (with a few extra nodes for high availability). Such boutique graph applications, which typically maintain on the order of 100 million edges, are more elegantly served by Titan 0.4.1+. In Titan 0.4.1, the in-memory caches have been advanced to support faster traversals which makes Titan’s single-machine performance comparable to other single machine-oriented graph databases. Moreover, as the application scales beyond the confines of a single machine, simply adding more nodes to the Titan cluster allows boutique graph applications to seamlessly grow to become Big Graph Data applications (see Single Server to Highly Available Cluster).

A short walk on the technical side of Titan.

I would replace “boutique” with “big data” and say Titan allows customers to seamlessly transition from “big data” to “bigger data.”

Having “big data” is like having a large budget under your control.

What matters is the user is the status of claiming to possess it.

Let’s not disillusion them. 😉

Faunus 4.1 Release

Monday, November 25th, 2013

Faunus 4.1 Release

I don’t find this change reflected in the 4.1 release notes but elsewhere Marko Rodriguez writes:

I tested the new code on a subset of the Friendster data (6 node Hadoop and 6 node Cassandra cluster).

    vertices: 7 minutes to write 39 million vertices at ~100mb/second from the Hadoop to the Cassandra cluster.

  • edges: 15 minutes to write 245 million edges at ~40mb/second from the Hadoop to the Cassandra cluster.

This is the fastest bulk load time I’ve seen to date. This means, DBPedia can be written in ~20 minutes! I’ve attached an annotated version of the Ganglia monitor to the email that shows the outgoing throughput for the various stages of the MapReduce job. In the past, I was lucky to get 5-10mb/second out of the edge writing stage (this had to do with how I was being dumb about how reduce worked in Hadoop — wasn’t considering the copy/shuffle aspect of the stage).

At this rate, this means we can do billion edges graphs in a little over 1 hour. I bet though I can now speed this up more with some parameter tweaking as I was noticing that Cassandra was RED HOT and locking up a few times on transaction commits. Anywho, Faunus 0.4.1 is going to be gangbusters!

Approximately one billion edges an hour?

It’s not > /dev/null speed but still quite respectable. 😉

Faunus 0.4.1 wikidoc.

Download Faunus 0.4.1.

Faunus & Titan 0.4.0 Released

Wednesday, October 16th, 2013

Faunus & Titan 0.4.0 Released by Dan LaRocque.

Dan’s post:

Aurelius is pleased to announce the release of Titan and Faunus 0.4.0.

This is a new major release which changes Titan’s client API, internal architecture, and storage format, and as such should be considered non-stable for now.

Downloads:

* https://github.com/thinkaurelius/titan/wiki/Downloads#titan-040-experimental-release

* https://github.com/thinkaurelius/faunus/wiki/Downloads

The artifacts have propagated to Maven Central, though they have yet to appear in the search index on search.maven.org.

New Titan features:

* MultiQuery, which speeds up traversal queries by an order of magnitude for common branching factors

* Initial Fulgora release with the introduction of an in-memory storage backend for Titan based on Hazelcast

* A new Persistit backend (special thanks to Blake Eggleston)

* Completely refactored query optimization and execution framework which makes query answering faster – in particular for GraphQuery

* Metrics integration for monitoring

* additional GraphQuery primitives and support in ElasticSearch and Lucene

* refactoring and deeper testing of the standard locking implementation

* redesigned type definition API

* much more

Titan 0.4.0 uses a new storage format which is incompatible with older versions of Titan. It also introduces backwards-incompatible API changes around type definition.

Titan release notes:

https://github.com/thinkaurelius/titan/wiki/Release-Notes#version-040-october-16-2013

Titan upgrade instructions:

https://github.com/thinkaurelius/titan/wiki/Upgrade-Instructions#version-040-october-16-2013

New Faunus features:

* Added FaunusRexsterExecutorExtension which allows remote execution of a Faunus script and tracking of its progress

* Global GremlinFaunus variables are now available in ScriptEngine use cases

* Simplified ResultHookClosure with new Gremlin 2.4.0 classes

* The variables hdfs and local are available to `gremlin.sh -e`

Faunus release notes:

https://github.com/thinkaurelius/faunus/wiki/Release-Notes

Both Faunus and Titan now support version 2.4.0 of the Tinkerpop stack, including Blueprints.

Both Faunus and Titan now require Java 7.

Thanks to everybody who contributed code and reported bugs in the 0.3.x series and helped us improve this release.

Enjoy!

Loopy Lattices Redux

Thursday, June 13th, 2013

Loopy Lattices Redux by Marko A. Rodriguez.

Comparison of Titan and Faunus counting the number of paths in a 20 x 20 lattice.

Interesting from a graph-theoretic perspective but since the count can be determined analytically, I am not sure of the utility of being about to count the paths?

In some ways this reminds me of Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner and Sebastian Stolzenberg, New Journal of Physics 11 (2009) 023001.

The question Timme and colleagues were investigating was the coloring of nodes in a graph which depended upon the coloring of other nodes. For a chess board sized graph, the calculation is estimated to take billions of years. The technique developed here takes less than seven (7) seconds for a chess board sized graph.

Traditionally, assigning a color to a vertex required knowledge of the entire graph. Here, instead of assigning a color, the color that should be assigned is represented by a formula stating the unknowns. Once all the nodes have such a formula:

The computation of the chromatic polynomial has been reduced to a process of alternating expansion of expressions and symbolically replacing terms in an appropriate order. In the language of computer science, these operations are represented as the expanding, matching and sorting of patterns, making the algorithm suitable for computer algebra programs optimized for pattern matching.

What isn’t clear is whether a similar technique could be applied to merging conditions where the merging state of a proxy depends upon, potentially, all other proxies.

Faunus: Graph Analytics Engine

Friday, May 17th, 2013

Faunus: Graph Analytics Engine by Marko Rodriguez.

From the description:

Faunus is a graph analytics engine built atop the Hadoop distributed computing platform. The graph representation is a distributed adjacency list, whereby a vertex and its incident edges are co-located on the same machine. Querying a Faunus graph is possible with a MapReduce-variant of the Gremlin graph traversal language. A Gremlin expression compiles down to a series of MapReduce-steps that are sequence optimized and then executed by Hadoop. Results are stored as transformations to the input graph (graph derivations) or computational side-effects such as aggregates (graph statistics). Beyond querying, a collection of input/output formats are supported which enable Faunus to load/store graphs in the distributed graph database Titan, various graph formats stored in HDFS, and via arbitrary user-defined functions. This presentation will focus primarily on Faunus, but will also review the satellite technologies that enable it.

I saw this slide deck after posting ConceptNet5 [Herein of Hypergraphs] and writing about the “id-less” nodes and edges of ConceptNet5.

So when I see nodes and edges with IDs, I have to wonder why?

What requirement is being met or advantage that is obtained by using IDs and not addressing a node by its content?*

Remembering that we are no longer concerned with shaving bits off of identifiers for storage and/or processing concerns.


* I suspect that addressing by content presumes a level of granularity that may not be appropriate in all cases. Hard to say. But I do want to look at the issue more closely.

Adding Value through graph analysis…

Friday, March 8th, 2013

Adding Value through graph analysis using Titan and Faunus by Matthias Broecheler.

Alludes to Titan 0.3.0 release but the latest I saw at the Titan site was 0.2.0. Perhaps 0.3.0 will be along presently.

I don’t recall seeing Titan listed in the Literature Survey of Graph Databases so I have sent the author a note about including Titan in any updates to the survey.

BTW, I would not take the ages on slide 35 seriously. 😉

Distributed Graph Computing with Gremlin

Thursday, March 7th, 2013

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

  • Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.

Importing RDF into Faunus

Tuesday, January 15th, 2013

RDF Format

Description of RDFInputFormat for Faunus to convert the edge list format of RDF into the adjacency list used by Faunus.

Currently supports:

  • rdf-xml
  • n-triples
  • turtle
  • n3
  • trix
  • trig

The converter won’t help with the lack of specified identification properties.

But, format conversion can’t increase the amount of information stored in a format.

At best it can be lossless.

Big Graph Data on Hortonworks Data Platform

Thursday, December 13th, 2012

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.

Solving Problems with Graphs

Monday, December 3rd, 2012

Solving Problems with Graphs by Marko A. Rodriguez.

Marko covers solving problems with graphs in general and then gives an overview of Titan (a distributed graph database), Faunus (graph analytic engine) and Fulgora (graph processor).

My only misgiving about graphs is that we know very little of the world’s data is stored in graph format. And that is unlikely to change in the foreseeable future. ETL will suffice convert some data to obtain the advantages of graph processing, but what of data that isn’t converted?

Unlike the W3C, I have a high degree of confidence that the world is not going to adapt itself to any one solution or even a range of solutions.

The majority of data (from a current perspective), will be in “legacy” formats, the next largest portion in the successful formats just prior to the latest one, and the smallest portion, the latest proposed new format.

Big data should address the “not my format” problem in addition to running after large amounts of sensor data.

Faunus Provides Big Graph Data Analytics

Monday, November 12th, 2012

Faunus Provides Big Graph Data Analytics by Marko A. Rodriguez.

Marko walks through the processing of:

The DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.

In Titan with Faunus.

If you had any doubts about Faunus, walking through the processing of DBpedia should make you more confident.

Path Report: dbpedia graph

Saturday, November 10th, 2012

Marko A. Rodriguez tweets:

There are 251,818,304,970,074,185 (251 quadrillion) length 5 paths in the #dbpedia graph.

Just in case you are curious.

With a pointer to: Faunus.

One of the use cases for Faunus is graph derivation:

Given an input graph, derive a new graph based upon the input graph’s structure and semantics. Other terms include graph rewriting and graph transformations.

Sounds like merging would fit into “derivation,” “graph rewriting” and “graph transformation” doesn’t it?

Or even spawning content in one graph based in its structure or semantics, using structure and semantics from one or more other graphs as sources.

Much to be thought about here.

Faunus

Thursday, September 27th, 2012

Faunus

From the home page:

Faunus is a Hadoop based distributed computing framework for processing property graphs. A breadth-first version of the graph traversal language Gremlin operates on a vertex-centric property graph data structure. Faunus provides adaptors to the distributed graph database Titan, any Rexster fronted graph database, and to text and binary graphs stored in HDFS. The provided Gremlin operations and Hadoop graph tools can be extended using MapReduce and Blueprints.

Warning: Limitation on Vertexes

Faunus Vertex

  • id: a vertex id is a positive long value and therefore, a graph in Faunus can not have more than 9,223,372,036,854,775,807 vertices.
  • properties: the size of the properties map is denoted by a positive short and therefore there can not exist more than 32,767 properties per vertex.
  • edges:

    • unique labels: edges are indexed by their label using a short and therefore, there can not be more than 32,767 unique labels for the incoming (or outgoing) edges of a vertex.
    • total edges: the edge size for any one label is represented by an int and therefore, for any direction and any label, there can not be more than 2,147,483,647 edges.

Warning: Limitation on Edges

Faunus Edge

  • id: an edge id is a positive long value and therefore, a graph in Faunus can not have more than 9,223,372,036,854,775,807 edges.
  • properties: the size of the properties map is denoted by a positive short and therefore there can not exist more than 32,767 properties per edge.

I don’t like putting limitation warnings in my first post on software but thought you needed to be forewarned. 😉