Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 16, 2013

imGraph: A distributed in-memory graph database

Filed under: Graphs,Neo4j,Titan — Patrick Durusau @ 3:51 pm

imGraph: A distributed in-memory graph database by Salim Jouili.

From the post:

Eura Nova contribution

Having these challenges in mind, we introduce a new graph database system called imGraph. We have considered the random access requirement for large graphs as a key factor on deciding the type of storage. Then, we have designed a graph database where all data is stored in memory so the speed of random access is maximized. However, as large graphs can not be completely loaded in the RAM of a single machine, we designed imGraph as distributed graph database. That is, the vertices and the edges are partitioned into subsets, and each subset is located in the memory of one machine belonging to the involved machines (see the following figure). Furthermore, we implemented on imGraph a graph traversal engine that takes advantage of distributed parallel computing and fast in-memory random access to gain performance.

I haven’t verified the numbers but imGraph is reported to have beaten both Titan and Neo4j by x150 and x200, respectively on particular data sets.

Enough to justify reading the paper.

The test machines each had 7.5 GB of memory, which seems a little lite to me.

Particularly since the IBM Power 770 server can expand to hold 4 TB of memory.

Imagine the performance on five (5) machines where each has 4 TB of memory.

True, it would be more expensive but at some point, there is only so much performance you can squeeze out of a commodity box.

BTW, the paper: imGraph: A distributed in-memory graph database.

June 13, 2013

Loopy Lattices Redux

Filed under: Faunus,Graphs,Networks,Titan — Patrick Durusau @ 4:45 pm

Loopy Lattices Redux by Marko A. Rodriguez.

Comparison of Titan and Faunus counting the number of paths in a 20 x 20 lattice.

Interesting from a graph-theoretic perspective but since the count can be determined analytically, I am not sure of the utility of being about to count the paths?

In some ways this reminds me of Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner and Sebastian Stolzenberg, New Journal of Physics 11 (2009) 023001.

The question Timme and colleagues were investigating was the coloring of nodes in a graph which depended upon the coloring of other nodes. For a chess board sized graph, the calculation is estimated to take billions of years. The technique developed here takes less than seven (7) seconds for a chess board sized graph.

Traditionally, assigning a color to a vertex required knowledge of the entire graph. Here, instead of assigning a color, the color that should be assigned is represented by a formula stating the unknowns. Once all the nodes have such a formula:

The computation of the chromatic polynomial has been reduced to a process of alternating expansion of expressions and symbolically replacing terms in an appropriate order. In the language of computer science, these operations are represented as the expanding, matching and sorting of patterns, making the algorithm suitable for computer algebra programs optimized for pattern matching.

What isn’t clear is whether a similar technique could be applied to merging conditions where the merging state of a proxy depends upon, potentially, all other proxies.

May 14, 2013

HeadStart for Planet Earth [Titan]

Filed under: Education,Graphs,Networks,Titan — Patrick Durusau @ 8:45 am

Educating the Planet with Pearson by Marko A. Rodriguez.

From the post:

Pearson is striving to accomplish the ambitious goal of providing an education to anyone, anywhere on the planet. New data processing technologies and theories in education are moving much of the learning experience into the digital space — into massive open online courses (MOOCs). Two years ago Pearson contacted Aurelius about applying graph theory and network science to this burgeoning space. A prototype proved promising in that it added novel, automated intelligence to the online education experience. However, at the time, there did not exist scalable, open-source graph database technology in the market. It was then that Titan was forged in order to meet the requirement of representing all universities, students, their resources, courses, etc. within a single, unified graph. Moreover, beyond representation, the graph needed to be able to support sub-second, complex graph traversals (i.e. queries) while sustaining at least 1 billion transactions a day. Pearson asked Aurelius a simple question: “Can Titan be used to educate the planet?” This post is Aurelius’ answer.

Liking the graph approach in general and Titan in particular does not make me any more comfortable with some aspects of this posting.

You don’t need to spin up a very large Cassandra database on Amazon to see the problems.

Consider the number of concepts for educating the world, some 9,000 if the chart is to be credited.

Suggested Upper Merged Ontology (SUMO) has “~25,000 terms and ~80,000 axioms when all domain ontologies are combined.

The SUMO totals being before you get into the weeds of any particular subject, discipline or course material.

Or the subset of concepts and facts represented in DBpedia:

The English version of the DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.

In addition, we provide localized versions of DBpedia in 111 languages. All these versions together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia. The full DBpedia data set features labels and abstracts for 10.3 million unique things in up to 111 different languages; 8.0 million links to images and 24.4 million HTML links to external web pages; 27.2 million data links into external RDF data sets, 55.8 million links to Wikipedia categories, and 8.2 million YAGO categories. The dataset consists of 1.89 billion pieces of information (RDF triples) out of which 400 million were extracted from the English edition of Wikipedia, 1.46 billion were extracted from other language editions, and about 27 million are data links to external RDF data sets. The Datasets page provides more information about the overall structure of the dataset. Dataset Statistics provides detailed statistics about 22 of the 111 localized versions.

I don’t know if the 9,000 concepts cited in the post would be sufficient for a world wide HeadStart program in multiple languages.

Moreover, why would any sane person want a single unified graph to represent course delivery from Zaire to the United States?

How is a single unified graph going to deal with the diversity of educational institutions around the world? A diversity that I take as a good thing.

It sounds like Pearson is offering a unified view of education.

My suggestion is to consider the value of your own diversity before passing on that offer.

March 29, 2013

Titan 0.3.0 Released

Filed under: Graphs,Networks,Titan — Patrick Durusau @ 3:38 pm

Titan 0.3.0 Released

From the webpage:

Titan 0.3.0 has been released and is ready for download. This release provides a complete performance-driven redesign of many core components. Furthermore, the primary outward facing feature is advanced indexing. The new indexing features are itemized below:

  • Geo: Search for elements using shape primitives within a 2D plane.
  • Full-text: Search elements for matching string and text properties.
  • Numeric range: Search for elements with numeric property values using intervals.
  • Edge: Edges can be indexed as well as vertices.

The Titan tutorial demonstrates the new capabilities.

This should keep you busy over the weekend!

March 10, 2013

Titan 0.3.0

Filed under: Graphs,Titan — Patrick Durusau @ 3:15 pm

Titan 0.3.0 (roadmap) by Matthias Broecheler.

From the post:

just wanted to share with you an update on the Titan roadmap. We re-prioritized a bunch of features and decided that it was about time to remove some technical debt in the Titan core module. This turned out into a major rewrite Titan’s internals which opened the door to adding some great new features. With that many changes, Titan 0.3.0 will be backwards incompatible, so we decided to do a 0.2.1 release first, which includes a bunch of bugfixes, the multi-module refactoring and other changes that we have added to master over the last two months. Titan 0.2.1-SNAPSHOT has been deployed to sonatype and will be released in two weeks.

Titan 0.3.0-SNAPSHOT currently lives in the “indexing” branch which indicates one of the major new features that will be coming in Titan 0.3.0: full-text indexing, numeric range indexing, and geospatial indexing for both vertices and edges. These advanced indexing capabilities are provided by ElasticSearch (http://www.elasticsearch.org/) and Lucene (http://lucene.apache.org/) which are now integrated into Titan and available as Titan modules. Similarly to storage backends, Titan abstract external indexes which allows it to interface with arbitrary indexing solutions. We chose Lucene for this initial release because its the most popular and most mature indexing system in the open source domain. Like BerkeleyDB, it is designed for single machine use. ElasticSearch is a fairly young but quickly maturing open source project build on top of Lucene that scales to multiple servers and is robust against failure. Hence, it is an ideal partner for Cassandra or Hbase.

….

Since a lot of people have asked for this feature, I thought you might want to take a look at Titan 0.3.0-SNAPSHOT and play around with it to give us some feedback on this new feature. Note, that Titan 0.3.0 is not yet stable as we are still tinkering with the interface and sorting out some hyper threading issues.
Other things that are new in 0.3.0:

  • use “unique” in type definitions to mark labels and keys as functional (i.e. unique(Direction.OUT)). That allows us to remove that mathematical “functional”.
  • complete rewrite of the caching engine which is now much better about caching vertex centric query results
  • better byte representation and lazy de-serialization for better performance
  • better query optimization and query rewriting for both vertex centric queries and global graph queries
  • Edge now longer extends Vertex. Access to unidirectional edges through get/setProperty
  • Properties on vertices can have properties on them (mind boggling…) which is very useful for version, timestamping, etc

The “properties on vertices can have properties on them,” reminds me of scope in topic maps.

March 8, 2013

Adding Value through graph analysis…

Filed under: Faunus,Graphs,Titan — Patrick Durusau @ 6:17 am

Adding Value through graph analysis using Titan and Faunus by Matthias Broecheler.

Alludes to Titan 0.3.0 release but the latest I saw at the Titan site was 0.2.0. Perhaps 0.3.0 will be along presently.

I don’t recall seeing Titan listed in the Literature Survey of Graph Databases so I have sent the author a note about including Titan in any updates to the survey.

BTW, I would not take the ages on slide 35 seriously. 😉

March 7, 2013

Distributed Graph Computing with Gremlin

Filed under: Distributed Systems,Faunus,Graph Databases,Graphs,Gremlin,Titan — Patrick Durusau @ 2:53 pm

Distributed Graph Computing with Gremlin by Marko A. Rodriguez.

From the post:

The script-step in Faunus’ Gremlin allows for the arbitrary execution of a Gremlin script against all vertices in the Faunus graph. This simple idea has interesting ramifications for Gremlin-based distributed graph computing. For instance, it is possible evaluate a Gremlin script on every vertex in the source graph (e.g. Titan) in parallel while maintaining data/process locality. This section will discuss the following two use cases.

  • Global graph mutations: parallel update vertices/edges in a Titan cluster given some arbitrary computation.
  • Global graph algorithms: propagate information to arbitrary depths in a Titan cluster in order to compute some algorithm in a parallel fashion.

Another must read post from Marko A. Rodriguez!

Also a reminder that I need to pull out my Oxford Classical Dictionary to add some material to the mythology graph.

December 26, 2012

Titan-Android

Filed under: Graphs,Gremlin,Networks,TinkerPop,Titan — Patrick Durusau @ 3:34 pm

Titan-Android by David Wu.

From the webpage:

Titan-Android is a port/fork of Titan for the Android platform. It is meant to be a light-weight implementation of a graph database on mobile devices. The port removes HBase and Cassandra support as their usage make little sense on a mobile device (convince me otherwise!). Gremlin is only supported via the Java interface as I have not been able to port groovy successfully. Nevertheless, Titan-Android supports local storage backend via BerkeleyDB and supports the Tinkerpop stack natively.

Just in case there was an Android under the tree!

I first saw this in a tweet by Marko A. Rodriguez.

December 13, 2012

Big Graph Data on Hortonworks Data Platform

Filed under: Aurelius Graph Cluster,Faunus,Gremlin,Hadoop,Hortonworks,Titan — Patrick Durusau @ 5:24 pm

Big Graph Data on Hortonworks Data Platform by Marko Rodriguez.

The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

If you like graphs at all or have been looking at graph solutions, you are going to like this post.

December 3, 2012

Solving Problems with Graphs

Filed under: Faunus,Fulgora,Graphs,Titan — Patrick Durusau @ 3:20 pm

Solving Problems with Graphs by Marko A. Rodriguez.

Marko covers solving problems with graphs in general and then gives an overview of Titan (a distributed graph database), Faunus (graph analytic engine) and Fulgora (graph processor).

My only misgiving about graphs is that we know very little of the world’s data is stored in graph format. And that is unlikely to change in the foreseeable future. ETL will suffice convert some data to obtain the advantages of graph processing, but what of data that isn’t converted?

Unlike the W3C, I have a high degree of confidence that the world is not going to adapt itself to any one solution or even a range of solutions.

The majority of data (from a current perspective), will be in “legacy” formats, the next largest portion in the successful formats just prior to the latest one, and the smallest portion, the latest proposed new format.

Big data should address the “not my format” problem in addition to running after large amounts of sensor data.

November 12, 2012

Faunus Provides Big Graph Data Analytics

Filed under: Faunus,Graphs,Titan — Patrick Durusau @ 8:52 pm

Faunus Provides Big Graph Data Analytics by Marko A. Rodriguez.

Marko walks through the processing of:

The DBpedia knowledge base currently describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including 764,000 persons, 573,000 places (including 387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), 192,000 organizations (including 45,000 companies and 42,000 educational institutions), 202,000 species and 5,500 diseases.

In Titan with Faunus.

If you had any doubts about Faunus, walking through the processing of DBpedia should make you more confident.

November 11, 2012

Hermes

Filed under: Graphs,Hermes,Titan — Patrick Durusau @ 1:38 pm

Hermes by Zack Maril.

From the webpage:

A Clojure library designed to make it easy to work with embedded Titan graphs.

Check clojars for the latest jar. The best thing to do is set up lein checkouts and clone the library directly.

This is very much a work in progress. Titan is a young project and has yet to even hit 0.2. Hermes will probably always be a work in progress until Titan hits 1.0. So, use this in production at your own peril.

The best bet right now is to read the source code. We are still writing the docs and example projects.

If you are interested in graph databases, you should know about Titan graphs. If you don’t, correct that before reading about Hermes.

Avoid use in production but do consider contributing examples, documentation.

August 7, 2012

Titan Provides Real-Time Big Graph Data

Filed under: Amazon Web Services AWS,Graphs,Titan — Patrick Durusau @ 10:50 am

Titan Provides Real-Time Big Graph Data

From the post:

Titan is an Apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph. In order to substantiate the aforementioned statement, this post presents empirical results of Titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. These users are interacting with 40 m1.small Amazon EC2 servers which are transacting with a 6 machine Amazon EC2 cc1.4xl Titan/Cassandra cluster.

The presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. The presentation concludes with a discussion of the Amazon EC2 cluster architecture used and the associated costs of running that architecture in a production environment. In short summary, Titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time Big Graph Data.

Fuller version of the information you will find at: Titan Stress Poster [Government Comparison Shopping?].

BTW, Titan is reported to emerge as 0.1 (from 0.1 alpha) later this (2012) summer.

July 13, 2012

Titan Stress Poster [Government Comparison Shopping?]

Filed under: Amazon Web Services AWS,Titan — Patrick Durusau @ 4:45 pm

Titan Stress Poster from Marko A. Rodriguez.

Notice of a poster at GraphLab 2012 with Matthias Broecheler:

This poster presents an overview of Titan along with some excellent stress testing done by Matthias and Dan LaRoque. The stress test uses a 6 machine Titan cluster with 14 read/write servers slamming Titan with various read/writes. The results are presented in terms of the number of bytes being read/write from disk, the average runtime of the queries, the cost of a transaction on Amazon EC2, and a speculation of the number of concurrent users are concurrently interacting.

Being a poster you will have to pump up the size for legibility but I think you will like the poster.

Impressive numbers. Including the Amazon EC2 cost.

Makes me wonder when governments are going to start requiring cost comparisons for system bids versus use of Amazon EC2?

June 14, 2012

Titan: The Rise of Big Graph Data [SLF4J conflicts] + Solution to Transaction issue

Filed under: BigData,Graphs,Titan — Patrick Durusau @ 3:14 pm

Titan: The Rise of Big Graph Data by Marko O. Rodriguez and Matthias Broecheler.

Description:

A graph is a data structure composed of vertices/dots and edges/lines. A graph database is a software system used to persist and process graphs. The common conception in today’s database community is that there is a tradeoff between the scale of data and the complexity/interlinking of data. To challenge this understanding, Aurelius has developed Titan under the liberal Apache 2 license. Titan supports both the size of modern data and the modeling power of graphs to usher in the era of Big Graph Data. Novel techniques in edge compression, data layout, and vertex-centric indices that exploit significant orders are used to facilitate the representation and processing of a single atomic graph…

Some minor corrections:

Correct: http://thinkaurelius/titan.zip to: https://github.com/thinkaurelius/titan (slide 109)

Correct: titan/ to: titan-0.1-apha/

Another clash with Ontopia, multiple SLF4J bindings (recalling that I had to put slf4j-log4j2-1.5.11.jar explicitly in my class path, Ontopia 5.2.1 and the CLASSPATH). Clashes with :/home/patrick/working/titan-0.1-alpha/lib/slf4j-log4j12-1.6.1.jar.

Fixed that, need a solution to easily switch classpaths.

It did startup after that fix.

I got to slide 116, trying to load ‘data/graph-of-the-gods.xml’ when Titan started returning error messages. Send messages + stack trace to Marko.

Will report back when I find where i went wrong or the software has that bug fixed.

This is a very exciting project so I suggest that you take a look it sooner rather than later.


Update: (I should get this sort of response time from commercial vendors):

From Matthias:

there is a bit of a transactional hickup in Titan/Blueprints right now. In Titan, every operation on the graph occurs in the context of a transaction. In Blueprints, calling startTransaction requires that no other transaction is currently running for that thread. loadGraphML calls “startTransaction”. Putting all of those together you get the exception below.

So, to get around it, you would have to call “stopTransaction(SUCCESS)” before loading the data. We should add that to the slides.

However, we are hoping that this situation is temporary, meaning having to call stopTransaction explicitly.

One proposal is to have startTransaction automatically “attach” to the previous transaction since this is completely acceptable behavior in most if not all situations. This is currently in the pipeline.

So, the slides read:

[Incorrect]
gremlin> g.createKeyIndex(‘name’, Vertex.class)
==>null
gremlin> g.loadGraphML(‘data/graph-of-the-gods.xml’)
==>null

Should read:

[Correct]
gremlin> g.createKeyIndex(‘name’, Vertex.class)
==>null
gremlin> g.stopTransaction(SUCCESS)
==>null

The Getting Started page for Titan appears to be more accurate than the slides (at least so far). 😉

May 30, 2012

Titan

Filed under: Graph Databases,Graphs,Titan — Patrick Durusau @ 1:01 pm

Titan

Alpha Release Coming June 5, 2012

From the homepage:

Titan is a distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster. The primary features of Titan are itemized below.

If the names Marko A. Rodriguez or Matthias Broecheler mean anything to you, June 5th can’t come soon enough!

« Newer Posts

Powered by WordPress