Archive for July, 2011

4th International SWAT4LS Workshop

Sunday, July 31st, 2011

4th International SWAT4LS Workshop Semantic Web Applications and Tools for Life Sciences

December 9th, 2011 London, UK

Important Dates:

  • Expression of interest for turorials: 10 June 2011
  • Submission openinig: 12 September 2011
  • Papers submission deadline: 7 October 2011
  • Posters and demo submission deadline: 31 October 2011
  • Communication of acceptance: 7 November 2011
  • Camera ready: 21 November 2011

From the Call for Papers:

Since 2008, SWAT4LS is a workshop that has provided a platform for the presentation and discussion of the benefits and limits of applying web-based information systems and semantic technologies in Biomedical Informatics and Computational Biology.

Growing steadily each year as Semantic Web applications become more widespread, SWAT4LS has been in Edinburgh 2008, Amsterdam 2009, and Berlin 2010, with London planned for 2011. The last edition of SWAT4LS was held in Berlin, on December 10th, 2010. It was preceded by two days of tutorials and other associated events.

We are confident that the next edition of SWAT4LS will provide the same open and stimulating environment that brought together researchers, both developers and users, from the various fields of Biology, Bioinformatics and Computer Science, to discuss goals, current limits and real experiences in the use of Semantic Web technologies in Life Sciences.

Proceedings from earlier workshops:

1st International SWAT4LS Workshop (Edinburgh, 2008)

2nd International SWAT4LS Workshop (Amsterdam, 2009) Be aware that selected papers were revised and extended to appear in the Journal of Biomedical Semantics, Volume 2, Supplement 1.

3rd International SWAT4LS Workshop (Berlin, 2010)

Take it as fair warning, there is a lot of interesting material here. Come prepared to stay a while.

Riak and Python – 2nd August 2011

Sunday, July 31st, 2011

Riak and Python

A free webinar sponsored by basho.

Dates and times:

Tuesday, August 2, 2011 2:00 pm, Eastern Daylight Time (New York, GMT-04:00)
Tuesday, August 2, 2011 11:00 am, Pacific Daylight Time (San Francisco, GMT-07:00)
Tuesday, August 2, 2011 8:00 pm, Europe Summer Time (Berlin, GMT+02:00)


Building and Deploying a Simple Clone Using Riak, Luwak, and Riak Search

You know where to get Riak and Riak Search, Luwak:

Journal of Biomedical Semantics

Sunday, July 31st, 2011

Journal of Biomedical Semantics

From the webpage:

Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas:

Infrastructure for biomedical semantics: focusing on semantic resources and repositoires, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability.

Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.

As of 31 July 2011, here are the titles of the “latest” articles:

A shortest-path graph kernel for estimating gene product semantic similarity Alvarez MA, Qi X and Yan C Journal of Biomedical Semantics 2011, 2:3 (29 July 2011)

Semantic validation of the use of SNOMED CT in HL7 clinical documents Heymans S, McKennirey M and Phillips J Journal of Biomedical Semantics 2011, 2:2 (15 July 2011)

Protein interaction sentence detection using multiple semantic kernels Polajnar T, Damoulas T and Girolami M Journal of Biomedical Semantics 2011, 2:1 (14 May 2011)

Foundations for a realist ontology of mental disease Ceusters W and Smith B Journal of Biomedical Semantics 2010, 1:10 (9 December 2010)

Simple tricks for improving pattern-based information extraction from the biomedical literature Nguyen QL, Tikk D and Leser U Journal of Biomedical Semantics 2010, 1:9 (24 September 2010)

The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows Katayama T, Arakawa K, Nakao M, Ono K, Aoki-Kinoshita KF, Yamamoto Y, Yamaguchi A, Kawashima S et al. Journal of Biomedical Semantics 2010, 1:8 (21 August 2010)

Oh, did I mention this is an open access journal?

NoSQL NOW! August 23-25 San Jose

Sunday, July 31st, 2011

NoSQL NOW! August 23-25 San Jose

OK, it’s August in San Jose, CA and not at the DoubleTree Hotel (the usual Unicode conference site).

I think those are the only two negatives you can find about this conference!

Take a look at the program if you don’t want to take my word for it.

It isn’t clear if conference presentations will be posted and maintained as informal proceedings. This would be a good opportunity to start collecting that sort of thing.

Stig Database

Sunday, July 31st, 2011

Stig Database

From the webpage:

It’s a graph. Stig stores data in nodes, and stores the connections between data as edges between nodes. This makes it easy to do Kevin Bacon computations and social networking queries that look for the connections between people.

Functional query language. Stig has its own native language that is flexible and powerful, but at the same time, it can emulate SQL and other paradigms.

Distributed and scalable. Stig is horizontally sharded, meaning that you can add as many machines to the system as you want. Stig works whether you have one machine, a thousand machines, or a million machines.

Points of View. Since isolation often comes at the expense of concurrency, Stig implements a kind of data isolation called a point of view. Private (one person) and shared (some, but not all people) points of view do propagate out to the global (everyone) database over time, but for fast communication, they only change the data for the people who need to see it.

Time Travel. Stig keeps a history for everything it stores, so you don’t need to keep track of your changing data.

Durable sessions. Clients can disconnect and reconnect at will, while Stig operations continue running in the background.

If you are going to be at NoSQL NOW!, August 23-25, 2011, San Jose, CA, be sure to catch Jason Lucas’ presentation Stig: Social Graphs & Discovery at Scale. The software, source code, documentation, etc. will be released on or around that date.

XQuery As Semantic Lens?

Saturday, July 30th, 2011

With Michael Kay thinking about tuples in XQuery, I started to wonder about XQuery as a semantic lens?

I say that because in discussions of Linked Data for example, there is always the question of getting data sources to release their data as Linked Data and/or complaints about the nature or quality of the Linked Data released.

While it may not be true in all cases, my operating assumption is that a user wants only some small portion of data from any particular data source. If data is obtained and viewed as linked data or with whatever desired annotations or additional properties, why pester the data owner?

Or to take the other side, why should we be limited by the data owner’s imagination or views about the data? Our “view” of data is probably more valuable to us than its source in most (all?) cases.

I am sure there are cases where conversion or annotation of an entire data set makes analytic, economic or performance sense, assuming you have the resources to make the conversion.

But that won’t be the case for small groups or individuals who want to access large data stores. Being able to query for subsets of data that they can use creatively will be a real advantage for them.

Of course, I am interested in using XQuery to produce input for topic map engines and representing declarations of semantic equivalence.

Suggestions for view as examples?

BTW, when I posted about the XQuery/XPath drafts, I foolishly used the dated URLs. I should have used the latest version URLs. Unless you are tracing comments back to drafts or the history of evolution of the XQuery, the latest version is the one you would want.

XQuery 3.0 – Latest Version Links

XQuery 3.0: An XML Query Language

XQueryX 3.0

XSLT and XQuery Serialization 3.0

XQuery 3.0 Use Cases

XQuery 3.0 Requirements


Saturday, July 30th, 2011


It was interesting to see XQuery 3.0 introduce tuple operations.

It is important to see Michal Kay start to talk about implementing tuple operations in Saxon.

I wonder what it would take to create a profile of XQuery 3.0 that introduces a semantic equivalence operator?

Hypertable Binary Packages

Saturday, July 30th, 2011

Hypertable Binary Packages (download)

New release of Hypertable!

Change notes.

GDB for the Data Driven Age (STI Summit Position Paper)

Saturday, July 30th, 2011

GDB for the Data Driven Age (STI Summit Position Paper) by Orri Erling.

From the post:

The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.

Interesting post that includes the following observation:

Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling (“address as many columns in customer table” vs. “address normalized away under a contact entity”), normalization (“first name” and “last name” as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!

Yes, quite.

Worth a very close read.

Couchbase Server 2.0

Saturday, July 30th, 2011

Couchbase Releases Flagship NoSQL Database, Couchbase Server 2.0

From the release:

SAN FRANCISCO, Calif. – CouchConf San Francisco – July 29, 2011 – Couchbase, the leading NoSQL database company, today released a developer preview of Couchbase Server 2.0, the company’s high-performance, highly scalable, document-oriented NoSQL database. Couchbase Server 2.0 combines the unmatched elastic data management capabilities of Membase Server with the distributed indexing, querying and mobile synchronization capabilities of Apache CouchDB, the most widely deployed open source document database, to deliver the industry’s most powerful, bullet-proof NoSQL database technology.

The database world just gets more interesting with each passing day!

MongoDB Schema Design Basics

Friday, July 29th, 2011

MongoDB Schema Design Basics

From Alex Popescu’s myNoSQL:

For NoSQL databases there are no clear rules like the Boyce-Codd Normal Form database normalization. Data modeling and analysis of data access patterns are two fundamental activities. While over the last 2 years we’ve gather some recipes, it’s always a good idea to check what are the recommended ways to model your data with your choice of NoSQL database.

After the break, watch 10gen’s Richard Kreuter’s presentation on MongoDB schema design.

A must see video! hackathon this Friday (July 29, 2011)

Friday, July 29th, 2011 hackathon this Friday (July 29, 2011)

Drew Conway reports:

On Friday, July 29, 2011 with host its first ever open data/hack day event. As I am a New Yorker, I am very excited to be participating at the NYC satellite event, but I wanted to pass along this information to those of you who may not have seen it yet, or wish to participate at one of the other locations. Here is the pertinent information from the official announcement:

Apologies for the late notice, but I assume the data is still going to be available:

In March, we announced a new URL shortening service called automatically creates .gov URLs whenever you use bitly to shorten a URL that ends in .gov or .mil. We created this service to make it easy for people to know when a short URL will lead to official, and trustworthy, government information.

Data is created every time someone clicks on a link, which happens about 56,000 times each day. Together, these clicks show what government information people are sharing with their friends and networks. No one has ever had such a broad view of how government information is viewed and shared online.

Today, we’re excited to announce that all of the data created by clicks is freely available through the Developers page on We want as many people as possible to benefit from the insights we get from

Doesn’t 56,000 times a day sound a little low? I don’t doubt the numbers but I am curious about the lack of uptake.

Does anyone have numbers on the other URL shortening services for comparison?

Word Cloud in R

Friday, July 29th, 2011

Word Cloud in R

From the post:

A word cloud (or tag cloud) can be an handy tool when you need to highlight the most commonly cited words in a text using a quick visualization. Of course, you can use one of the several on-line services, such as wordle or tagxedo , very feature rich and with a nice GUI. Being an R enthusiast, I always wanted to produce this kind of images within R and now, thanks to the recently released Ian Fellows’ wordcloud package, finally I can!

In order to test the package I retrieved the titles of the XKCD web comics included in my RXKCD package and produced a word cloud based on the titles’ word frequencies calculated using the powerful tm package for text mining (I know, it is like killing a fly with a bazooka!).

I don’t care for word clouds but some people find them very useful. They certainly are an option to consider when offering your users views into texts.

Follow the pointers in this article to some of the on-line services or tweak your own in R.

State of HBase

Friday, July 29th, 2011

State of HBase by Michael Stack (StumbleUpon).

From the abstract:

Attendees will learn about the current state of the HBase project. We’ll review what the community is contributing, some of the more interesting production installs, killer apps on HBase, the on-again, off-again HBase+HDFS love affair, and what the near-future promises. A familiarity with BigTable concepts and Hadoop is presumed.

Catch the latest news on HBase!

AZOrange – Machine Learning for QSAR Modeling

Friday, July 29th, 2011

AZOrange – High performance open source machine learning for QSAR modeling in a graphical programming environment Jonna C Stalring, Lars A Carlsson, Pedro Almeida and Scott Boyer. Journal of Cheminformatics 2011, 3:28doi:10.1186/1758-2946-3-28


Machine learning has a vast range of applications. In particular, advanced machine learning methods are routinely and increasingly used in quantitative structure activity relationship (QSAR) modeling. QSAR data sets often encompass tens of thousands of compounds and the size of proprietary, as well as public data sets, is rapidly growing. Hence, there is a demand for computationally efficient machine learning algorithms, easily available to researchers without extensive machine learning knowledge. In granting the scientific principles of transparency and reproducibility, Open Source solutions are increasingly acknowledged by regulatory authorities. Thus, an Open Source state-of-the-art high performance machine learning platform, interfacing multiple, customized machine learning algorithms for both graphical programming and scripting, to be used for large scale development of QSAR models of regulatory quality, is of great value to the QSAR community.

Project homepage: AZOrange (Ubuntu, I assume it compiles and runs on other *nix platforms. I run Ubuntu so I need to setup another *nix distribution just for test purposes.)

MATLAB GPU / CUDA experiences

Thursday, July 28th, 2011

MATLAB GPU / CUDA experiences and tutorials on my laptop – Introduction

From the post:

These days it seems that you can’t talk about scientific computing for more than 5 minutes without somone bringing up the topic of Graphics Processing Units (GPUs). Originally designed to make computer games look pretty, GPUs are massively parallel processors that promise to revolutionise the way we compute.

A brief glance at the specification of a typical laptop suggests why GPUs are the new hotness in numerical computing. Take my new one for instance, a Dell XPS L702X, which comes with a Quad-Core Intel i7 Sandybridge processor running at up to 2.9Ghz and an NVidia GT 555M with a whopping 144 CUDA cores. If you went back in time a few years and told a younger version of me that I’d soon own a 148 core laptop then young Mike would be stunned. He’d also be wondering ‘What’s the catch?’

Parallel computing has been around for years but in the form of GPUs it has reached the hands of hackers and innovators. Will your next topic map application take advantage of parallel processing?

Neo4j 1.4 and Cypher

Thursday, July 28th, 2011

Neo4j 1.4 and Cypher

From the description:

Neo4j 1.4 has just been released and it’s chock full of new features for seasoned users and novices alike. At this meetup we’ll cover new features like auto-indexing, paged graph traversal and batch-orientation in the REST API. We’ll also dive a little deeper, revisiting the Doctor Who dataset and use the new Neo4j query language called “Cypher” to explore the universe in a humane, DBA-friendly manner. As always there’ll be plenty of scope for war stories and graph chit-chat, and a round of beers on the Neo4j team.

A new cache for Neo4j, part 1 :
The hash function

Thursday, July 28th, 2011

A new cache for Neo4j, part 1 : The hash function.

From the post:

Since my last post a lot of things have happened in Neo4j, some of which have made some of the material in this blog outdated. Imagine that! Just half a year later, the kernel crackdown (specifically, the storage layer walkthrough) is pretty much out of date. How awesome is that? My wonderful colleagues at Neo Technology however have shown me that staying still is not an option, so here I am, trying to make another post outdated. This time is the cache’s turn, which is about time it received an overhaul. Challenging the current implementation will start at an unexpected place (or is it?), the hash function.

Sounds like more improvements are in Neo4j’s future!

Indexing in Cassandra

Thursday, July 28th, 2011

Indexing in Cassandra by Ed Anuff.

As if you haven’t noticed by now, I have a real weakness for indexing and indexing related material.

Interesting coverage of composite indexes.

Another Word For It at #2,000

Thursday, July 28th, 2011

According to my blogging software this is my 2,000th post!

During the search for content and ideas for this blog I have thought a lot about topic maps and how to explain them.

Or should I say how to explain topic maps without inventing new terminologies or notations? 😉

Topic maps deal with a familiar problem:

People use different words when talking about the same subject and the same word when talking about different subjects.

Happens in conversations, newspapers, magazines, movies, videos, tv/radio, texts, and alas, electronic data.

The confusion caused by using different words for the same subject and same word for different subjects is a source of humor. (What does “nothing” stand for in Shakespeare’s “Much Ado About Nothing”?)

In searching electronic data, that confusion causes us to miss some data we want to find (different word for the same subject) and to find some data we don’t want (same word but different subject).

When searching old newspaper archives this can be amusing and/or annoying.

Potential outcomes of failure elsewhere:

medical literature injury/death/liability
financial records civil/criminal liability
patents lost opportunities/infringement
business records civil/criminal liability

Solving the problem of different words for the same subject and the same word but different subjects is important.

But how?

Topic maps and other solutions have one thing in common:

They use words to solve the problem of different words for the same subject and the same word but different subjects.


The usual battle cry is “if everyone uses my words, we can end semantic confusion, have meaningful interchange for commerce, research, cultural enlightenment and so on and so forth.”

I hate to be the bearer of bad news but what about all the petabytes of data we already have on hand with zettabytes of previous interpretations? With more being added every day and not universal solution in sight? (If you don’t like any of the current solutions, wait a few months and new proposals, schemas, vocabularies, etc., will surface. Or you can take the most popular approach and start your own.)

Proposals to deal with semantic confusion are also frozen in time and place. Unlike the human semantics they propose to sort out, they do not change and evolve.

We have to use the source of semantic difficulty, words, in crafting a solution and our solution has to evolve over time even as our semantics do.

That’s a tall order.

Part of the solution, if you want to call it that, is to recognize when the benefits of solving semantic confusion outweighs the cost of the solution. We don’t need to solve semantic confusion everywhere and anywhere it occurs. In some cases, perhaps rather large cases, it isn’t worth the effort.

That triage of semantic confusion allows us to concentrate on cases where the investment of time and effort are worthwhile. In searching for the Hilton Hotel in Paris I may get “hits” for someone with underwear control issues but so what? Is that really a problem that needs a solution?

On the other hand, being able to resolve semantic confusion, such as underlies different accounting systems for businesses, could give investors a clearer picture of the potential risks and benefits of particular investments. Or doing the same for financial institutions so that regulators can “look down” into regulated systems with some semantic coherence (without requiring identical systems).

Having chosen some semantic confusion to resolve, we then have to choose a method to resolve it.

One method, probably the most popular one, is the “use my (insert vocabulary)” method for resolving semantic confusion. Works and for some cases, may be all that you need. Databases with gigabyte size tables (and larger) operate quite well using this approach. Can become problematic after acquisitions when migration to other database systems is required. Undocumented semantics can prove to be costly in many situations.

Semantic Web techniques, leaving aside the fanciful notion of unique identifiers, do offer the capability of recording additional properties about terms or rather the subjects that terms represent. Problematically though, they don’t offer the capacity to specify which properties are required to distinguish one term from another.

No, I am not about to launch into a screed about why “my” system works better than all the others.

Recognition that all solutions are composed of semantic ambiguity is the most important lesson of the Topic Maps Reference Model (TMRM).

Keys (of key/value pairs) are pointers to subject representatives (proxies) and values may be such references. Other keys and/or values may point to other proxies that represent the same subjects. Which replicates the current dilemma.

The second important lesson of the TMRM is the use of legends to define what key/value pairs occur in a subject representative (proxy) and how to determine two or more proxies represent the same subject (subject identity).

Neither lesson ends semantic ambiguity, nor do they mandate any particular technology or methodology.

They do enable the creation and analysis of solutions, including legends, with an awareness they are all partial mappings, with costs and benefits.

I will continue the broad coverage of this blog on semantic issues but in the next 1,000 posts I will make a particular effort to cover:

  • Ex Parte Declaration of Legends for Data Sources (even using existing Linked Data where available)
  • Suggestions for explicit subject identity mapping in open source data integration software
  • Advances in graph algorithms
  • Sample topic maps using existing and proposed legends

Other suggestions?

Open Source Search Engines (comparison)

Wednesday, July 27th, 2011

Open Source Search Engines (comparison)

A comparison of ten (10) open source search engines.

Appears as an appendix to Modern Information Retrieval, second edition.

I probably don’t need yet another IR book.

But the first edition was well written, the second edition website includes teaching slides for all chapters, a nice set of pointers to additional resources, problems and solutions is “under construction” as of 27 July 2011, all of which are things I like to encourage in authors.

OK, I talked myself into it, I am ordering a copy today. 😉

More comments to follow.

NoSQL @ Netflix, Part 2

Wednesday, July 27th, 2011

NoSQL @ Netflix, Part 2 by Sid Anand.

OSCON 2011 presentation.

I think the RDBMS Concepts to Key-Value Store Concepts was the best part of the slide deck.

What do you think?

Learning SPARQL

Wednesday, July 27th, 2011

Learning SPARQL by Bob DuCharme.

From the author’s announcement (email):

It’s the only complete book on the W3C standard query language for linked data and the semantic web, and as far as I know the only book at all that covers the full range of SPARQL 1.1 features such as the ability to update data. The book steps you through simple examples that can all be performed with free software, and all sample queries, data, and output are available on the book’s website.

In the words of one reviewer, “It’s excellent—very well organized and written, a completely painless read. I not only feel like I understand SPARQL now, but I have a much better idea why RDF is useful (I was a little skeptical before!)” I’d like to thank everyone who helped in the review process and everyone who offered to help, especially those in the Charlottesville/UVa tech community.

You can follow news about the book and about SPARQL on Twitter at @learningsparql.

Remembering Bob’s “SGML CD,” I ordered a copy (electronic and print) of “Learning SPARQL” as soon as I saw the announcement in my inbox.

More comments to follow.

Neo4j: Super-Nodes and Indexed Relationships

Wednesday, July 27th, 2011

Neo4j: Super-Nodes and Indexed Relationships by Aleksa Vukotic.

From the post:

As part of my recent work for Open Credo, I have been involved in the project that was using Neo4J Graph database for application storage.

Neo4J is one of the first graph databases to appear on the global market. Being open source, in addition to its power and simplicity in supporting graph data model it represents good choice for production-ready graph database.

However, there has been one area I have struggled to get good-enough performance from Neo4j recently – super nodes.

Super nodes represent nodes with dense relationships (100K or more), which are quickly becoming bottlenecks in graph traversal algorithms when using Neo4J. I have tried many different approaches to get around this problem, but introduction of auto indexing in Neo4j 1.4 gave me an idea that I had success with. The approach I took is to try to fetch relationships of the super nodes using Lucene indexes, instead of using standard Neo APIs. In this entry I’ll share what I managed to achieve and how.

This looks very promising. Particularly the retrieval of only the relationships of interest for traversal. To me that suggests that we can keep indexes of relationships that may not be frequently consulted. I wonder if that means a facility to “expose” more or less relationships as the situation requires?

The possibilities of Hadoop for Big Data (video)

Wednesday, July 27th, 2011

The possibilities of Hadoop for Big Data (video)

Start off your thinking about topic map advertising with something effective and amusing!

It’s short, attention getting, doesn’t over promise or bore with details.

Enjoy! (And think about a topic map ad.)

A machine learning toolbox for musician
computer interaction

Tuesday, July 26th, 2011

A machine learning toolbox for musician computer interaction


This paper presents the SARC EyesWeb Catalog, (SEC), a machine learning toolbox that has been specifically developed for musician-computer interaction. The SEC features a large number of machine learning algorithms that can be used in real-time to recognise static postures, perform regression and classify multivariate temporal gestures. The algorithms within the toolbox have been designed to work with any N-dimensional signal and can be quickly trained with a small number of training examples. We also provide the motivation for the algorithms used for the recognition of musical gestures to achieve a low intra-personal generalisation error, as opposed to the inter-personal generalisation error that is more common in other areas of human-computer interaction.

Recorded at: 11th International Conference on New Interfaces for Musical Expression. 30 May – 1 June 2011, Oslo, Norway.

The paper: A machine learning toolbox for musician computer interaction

The software: SARC EyesWeb Catalog [SEC]

Although written in the context of musician-computer interaction, the techniques described here could just as easily be applied to exploration or authoring of a topic map. Or for that matter exploring a data stream that is being presented to a user.

Imagine that one hand gives “focus” to some particular piece of data and the other hand “overlays” a query onto that data that then displays a portion of a topic map with that data as the organizing subject. Based on that result the data can be simply dumped back into the data stream or “saved” for further review and analysis.

…Advanced Graph-­‐Analysis Algorithms on Very Large Graphs

Tuesday, July 26th, 2011

Enabling Rapid Development and Execution of Advanced Graph-­‐Analysis Algorithms on Very Large Graphs by Aydin Buluc, John Gilbert, Adam Lugowski, and, Steve Reinhardt.

Great overview of the Knowledge Discovery Toolkit project and its goals.

From the website:

The Knowledge Discovery Toolbox (KDT) provides domain experts with a simple interface to analyze very large graphs quickly and effectively without requiring knowledge of the underlying graph representation or algorithms. The current version provides a tiny selection of functions on directed graphs, from simple exploratory functions to complex algorithms. Because KDT is open-source, it can be customized or extended by interested (and intrepid) users.

The Graph 500 List

Tuesday, July 26th, 2011

The Graph 500 List

From the website:

Data intensive supercomputer applications are increasingly important for HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.

Backed by a steering committee of over 30 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set). Further, we are in the process of addressing five graph-related business areas: Cybersecurity, Medical Informatics, Data Enrichment, Social Networks, and Symbolic Networks.

This is the first serious approach to complement the Top 500 with data intensive applications. Additionally, we are working with the SPEC committee to include our benchmark in their CPU benchmark suite. We anticipate the list will rotate between ISC and SC in future years.


Tuesday, July 26th, 2011


From the webpage:

The py2neo project provides bindings between Python and Neo4j via its RESTful web service interface. It attempts to be both Pythonic and consistent with the core Neo4j API.

We’re currently looking for beta testers. If you can help, simply clone the source code from GitHub, use the API docs for reference and let us know how you get on!


Oozie by Example

Tuesday, July 26th, 2011

Oozie by Example

From the post:

In our previous article [Introduction to Oozie] we described Oozie workflow server and presented an example of a very simple workflow. We also described deployment and configuration of workflow for Oozie and tools for starting, stoping and monitoring Oozie workflows.

In this article we will describe a more complex Oozie example, which will allow us to discuss more Oozie features and demonstrate how to use them.

More on workflow for Hadoop!