Archive for February, 2012


Sunday, February 26th, 2012


From the description:

A .NET client for the neo4j REST API. neo4j is an open sourced, Java based transactional graph database. It’s pretty awesome.


Sunday, February 26th, 2012


NuoDB is in private beta but the homepage describes it as:

NuoDB is a NewSQL database. It looks and behaves like a traditional SQL database from the outside but under the covers it’s a revolutionary database solution. It is a new class of database for a new class of datacenter.

A technical webinar dated 14 December 2012 at slide 5 had a couple of points that puzzled me.

I need to check some references on some of them but the:

Zero DBA: No backups, minimal performance tuning, automated everything

seems a bit over the top.

Would anyone involved in the private beta care to comment on that claim?

Neo4j in a .Net World

Sunday, February 26th, 2012

Neo4j in a .Net World

From the description:

This month, Tatham Oddie will be coming from Australia to present at the Neo4j User Group on Neo4j with .NET, and will cover:

  • the Neo4j client we have built for .NET
  • hosting it all in Azure
  • why our queries were 200ms slower in the cloud, and how we fixed it

Tatham will present a case study, explaining:

  • what our project is
  • why we chose a graph db
  • how we modelled it to start with
  • how our first attempts at modelling were wrong
  • what we’re doing now

Where to Publish and Find Ontologies? A Survey of Ontology Libraries

Sunday, February 26th, 2012

Where to Publish and Find Ontologies? A Survey of Ontology Libraries by Natasha F. Noy and Mathieu d’Aquin.


One of the key promises of the Semantic Web is its potential to enable and facilitate data interoperability. The ability of data providers and application developers to share and reuse ontologies is a critical component of this data interoperability: if different applications and data sources use the same set of well defined terms for describing their domain and data, it will be much easier for them to “talk” to one another. Ontology libraries are the systems that collect ontologies from different sources and facilitate the tasks of finding, exploring, and using these ontologies. Thus ontology libraries can serve as a link in enabling diverse users and applications to discover, evaluate, use, and publish ontologies. In this paper, we provide a survey of the growing—and surprisingly diverse—landscape of ontology libraries. We highlight how the varying scope and intended use of the libraries affects their features, content, and potential exploitation in applications. From reviewing eleven ontology libraries, we identify a core set of questions that ontology practitioners and users should consider in choosing an ontology library for finding ontologies or publishing their own. We also discuss the research challenges that emerge from this survey, for the developers of ontology libraries to address.

Speaking of semantic colonialism, this survey is an accounting of the continuing failure of that program. The examples cited as “ontology libraries” are for the most part not interoperable with each other.

Not that I disagree that having greater data interoperability would be a bad thing, it would be a very good thing, for some issues. The problem, as I see it, is the fixation of the Semantic Web community on a winner-takes-all model of semantics. Could well be, (warning, heresy ahead) that RDF and OWL aren’t the most effective ways to represent or “reason” about data. Just saying, no proof, formal or otherwise to be offered.

And certainly there is a lack of data written using RDF (or even linked data) or annotated using OWL. I don’t think there is a good estimate of all available data so it is difficult to give a good figure for exactly how little of the overall amount of data that is in all the Semantic Web formats.

Any new format will only be applied to the creation of new data so that will leave us with the ever increasing mountains of legacy data which lack the new format.

Rather than seeking to reduce semantic diversity, what appears to be a losing bet, we should explore mechanisms to manage semantic diversity.

Semantic Colonialism

Sunday, February 26th, 2012

Here is a good example of semantic colonialism, UM Linguist Studies the Anumeric Language of an Amazonian Tribe. Not obvious from the title is it?

Two studies of the Piraha people of the Amazon, who lack words for numbers, produced different results when they were tested with simple numeric problems with more than three items. One set of results said they could perform them, the other, not.

The explanation for the difference?

The study provides a simple explanation for the controversy. Unbeknown to other researchers, the villagers that participated in one of the previous studies had received basic numerical training by Keren Madora, an American missionary that has worked with the indigenous people of the Amazon for 33 years, and co-author of this study. “Her knowledge of what had happened in that village was crucial. I understood then why they got the results that they did,” Everett says.

Madora used the Piraha language to create number words. For instance she used the words “all the sons of the hand,” to indicate the number four. The introduction of number words into the village provides a reasonable explanation for the disagreement in the previous studies.

If you think that the Piraha are “better off” having number words, put yourself down as a semantic colonialist.

You will have no reason to complain when terms used by Amazon, Google, Nike, Starbucks, etc., start to displace your native terminology.

Even less reason to complain if some Semantic Web ontology displace yours in the race to become the common ontology for some subject area.

After all, one semantic colonialist is much like any other. (Ask any former/current colony if you don’t believe me.)


Saturday, February 25th, 2012


Another Neo4j challenge contender!

Lists foods that go well together.

I tried “rice” and “red beans” did not come up. 🙁

I will have to add that tomorrow.

Apache Giraph

Saturday, February 25th, 2012

Apache Giraph

From the webpage:

Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site.

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.

Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.

Giraph 0.1-incubating released. (Feb. 6th, 2012)

Another graph contender.

How many do you have on your system?

Tracking People in the News with Newsle

Saturday, February 25th, 2012

Tracking People in the News with Newsle by Angela Guess.

From the post:, a web app that tracks people in the news, has released a new version featuring instant news alerts about users’ friends, colleagues, favorite public figures, or themselves. The startup also announced $600,000 in seed funding from Lerer Media Ventures, SV Angel, and an independent investor. According to the company website, “Newsle’s private beta launched in January 2011, and was covered by TechCrunch. The current version is a major evolution of the original concept. Newsle now combs the web continuously, analyzing over 1 million articles each day – every major news article and blog post published online, as well as most minor ones. Newsle’s core technology is its disambiguation algorithm, which determines whether an article mentioning “John Smith” is about the right person.”

If the disambiguation algorithm is accurate enough, perhaps Newsle can sell results to the U.S. government and other interested parties.

At least then the government (or even private corporations) would not have to reinvent the wheel. Not to mention having a line item in the budget against which to judge ROI.

If tracking all the members of U.S. college cheerleader teams in the news for some term of years has yielded no terrorist leads, time to shave that off of the data stream.

Video Search – Webmaster EDU

Saturday, February 25th, 2012

Video Search – Webmaster EDU

From the webpage:

In order to deliver search results, Google crawls the web and collects information about each piece of content. Often the best results are online videos and Google wants to help users find the most useful videos. Every day, millions of people find videos on Google search and we want them to be able to find your relevant video content.

Google is supporting for videos along with alternate ways to make sure Google can index your videos.

It is a fairly coarse start but beats no information about your videos at all.

Videos that can be easily found are more likely to be incorporated in topic maps (and other finding aids).

On nonmetric similarity search problems in complex domains

Saturday, February 25th, 2012

On nonmetric similarity search problems in complex domains by Tomáš Skopal and Benjamin Bustos.


The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can then be queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics—we call them nonmetric similarity functions. In this article we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.

The first paragraph of the conclusion of this survey on nonmetric similarity is an argument for topic maps (or at least the result of using a topic map):

In this article, we have surveyed the current situation concerning the employment of nonmetric similarity functions for effective and efficient similarity search in complex domains. One of the main results of the article is a surprising revelation that nonmetric similarity measuring is widely used in isolated domains, spanning many areas of interdisciplinary research. This includes multimedia databases, time series, and medical, scientific, chemical, and bioinformatic tasks, among others. (emphasis added)

True enough, survey articles such as this one may tempt a few researchers and possibly graduate students to peek over the discipline walls, however briefly. But research articles need to routinely cite the literature of other disciplines, betraying a current awareness of other fields. To take advantage of advances in other fields as well as to serve as an example for the next generation of researchers.

XML data clustering: An overview

Saturday, February 25th, 2012

XML data clustering: An overview by Alsayed Algergawy, Marco Mesiti, Richi Nayak, and Gunter Saake.


In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

I thought this survey article would be of particular interest since it covers the syntax and semantics of XML that contains data.

Not to mention that our old friend, heterogeneous data, isn’t far behind:

Since XML data are engineered by different people, they often have different structural and terminological heterogeneities. The integration of heterogeneous data sources requires many tools for organizing and making their structure and content homogeneous. XML data integration is a complex activity that involves reconciliation at different levels: (1) at schema level, reconciling different representations of the same entity or property, and (2) at instance level, determining if different objects coming from different sources represent the same real-world entity. Moreover, the integration of Web data increases the integration process challenges in terms of heterogeneity of data. Such data come from different resources and it is quite hard to identify the relationship with the business subjects. Therefore, a first step in integrating XML data is to find clusters of the XML data that are similar in semantics and structure [Lee et al. 2002; Viyanon et al. 2008]. This allows system integrators to concentrate on XML data within each cluster. We remark that reconciling similar XML data is an easier task than reconciling XML data that are different in structures and semantics, since the later involves more restructuring. (emphasis added)

Two comments to bear in mind while reading this paper.

First, print our or photocopy Table II on page 35, “Features of XML Clustering Approaches.” It will be a handy reminder/guide as you read the coverage of the various techniques.

Second, on the last page, page 41, note that the article was accepted in October of 2009 but not published until October of 2011. It’s great that the ACM has an abundance of excellent survey articles but a two year delay is publication is unreasonable.

Surveys in rapidly developing fields are of most interest when they are timely. Electronic publication upon final acceptance should be the rule at an organization such as the ACM.

A Survey of Automatic Query Expansion in Information Retrieval

Saturday, February 25th, 2012

A Survey of Automatic Query Expansion in Information Retrieval by Claudio Carpineto, Giovanni Romano.


The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning. AQE has a long history in the information retrieval community but it is only in the last years that it has reached a level of scientific and experimental maturity, especially in laboratory settings such as TREC. This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques. The following questions are addressed. Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?

Have you heard topic maps described as being the solution to the following problem?

The most critical language issue for retrieval effectiveness is the term mismatch problem: the indexers and the users do often not use the same words. This is known as the vocabulary problem Furnas et al. [1987], compounded by synonymy (same word with different meanings, such as “java”) and polysemy (different words with the same or similar meanings, such as “tv” and “television”). Synonymy, together with word inflections (such as with plural forms, “television” versus “televisions”), may result in a failure to retrieve relevant documents, with a decrease in recall (the ability of the system to retrieve all relevant documents). Polysemy may cause retrieval of erroneous or irrelevant documents, thus implying a decrease in precision (the ability of the system to retrieve only relevant documents).

That sounds like the XWindows index merging problem doesn’t it? (Different terms being used by *nix vendors who wanted to use a common set of XWindows documentation.)

The authors describe the amount of data on the web searched with only one, two or three terms:

In this situation, the vocabulary problem has become even more serious because the paucity of query terms reduces the possibility of handling synonymy while the heterogeneity and size of data make the effects of polysemy more severe.

But the size of the data isn’t a given. What if a topic map with scoped names were used to delimit the sites searched using a particular identifier.

For example, a topic could have the name: “TRIM19” and a scope of: “” If you try a search with “TRIM19” at the scoping site, you get a very different result than if you use “TRIM19” with say “”

Try it, I’ll wait.

Now, imagine that your scoping topic on “TRIM19” isn’t just that one site but a topic that represents all the gene database sites known to you. I don’t know the number but it can’t be very large, at least when compared to the WWW.

That simple act of delimiting the range of your searches, makes them far less subject to polysemy.

Not to mention that a topic map could be used to supply terms for use in automated query expansion.

BTW, the survey is quite interesting and deserves a slow read with follow up on the cited references.

FrostyMug – Beer Rating/Recommendation Service

Saturday, February 25th, 2012

Similarity-based Recommendation Engines by Josh Adell.

From the post:

I am currently participating in the Neo4j-Heroku Challenge. My entry is a — as yet, unfinished — beer rating and recommendation service called FrostyMug. All the major functionality is complete, except for the actual recommendations, which I am currently working on. I wanted to share some of my thoughts and methods for building the recommendation engine.

I hear “similarity” as a measure of subject identity: beers recommended to X; movies enjoyed by Y users, even though those are group subjects.

Or perhaps better, as a possible means of subject identity. A person could list all the movies they have enjoyed and that list be the same as a recommendation list. Same subject, just a different method of identification. (Unless the means of subject identification has an impact on the subject you think is being identified.)

Ontological Conjunctive Query Answering over large, semi-structured knowledge bases

Saturday, February 25th, 2012

Ontological Conjunctive Query Answering over large, semi-structured knowledge bases

From the description:

Ontological Conjunctive Query Answering knows today a renewed interest in knowledge systems that allow for expressive inferences. Most notably in the Semantic Web domain, this problem is known as Ontology-Based Data Access. The problem consists in, given a knowledge base with some factual knowledge (very often a relational database) and universal knowledge (ontology), to check if there is an answer to a conjunctive query in the knowledge base. This problem has been successfully studied in the past, however the emergence of large and semi-structured knowledge bases and the increasing interest on non-relational databases have slightly changed its nature.

This presentation will highlight the following aspects. First, we introduce the problem and the manner we have chosen to address it. We then discuss how the size of the knowledge base impacts our approach. In a second time, we introduce the ALASKA platform, a framework for performing knowledge representation & reasoning operations over heterogeneously stored data. Finally we present preliminary results obtained by comparing efficiency of existing storage systems when storing knowledge bases of different sizes on disk and future implications.

Slides help as always.

Introduces the ALASKA – Abstract Logic-based Architecture Storage systems & Knowledge base Analysis.

Its goal is to enable to perform OCQA in a logical, generic manner, over existing, heterogeneous storage systems.

“ALASKA” is the author’s first acronym.

The results for Oracle software (slide 25) makes me suspect the testing protocol. Not that Oracle wins every contest by any means but such poor performance indicates some issue other its native capabilities.

Graph Mining: Laws, Generators, and Algorithms

Friday, February 24th, 2012

Graph Mining: Laws, Generators, and Algorithms by Deepayan Chakrabarti and Christos Faloutsos.


How does theWeb look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs?” To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.

If any readers of this blog have doubts about the need for mappings between terminology in fields of research (topic maps), consider the authors remarking:

…we need to detect patterns in graphs and then generate synthetic graphs matching such patterns automatically.

This is a hard problem. What patterns should we look for? What do such patterns mean? How can we generate them? A lot of research ink has been spent on this problem, not only by computer scientists but also physicists, mathematicians, sociologists, and others. However, there is little interaction among these fields with the result that they often use different terminology and do not benefit from each other’s advances. In this survey, we attempt to give an overview of the main ideas. Our focus is on combining sources from all the different fields to gain a coherent picture of the current stateof-the-art. The interested reader is also referred to some excellent and entertaining books on the topic, namely, Barabási [2002],Watts [2003], and Dorogovtsev and Mendes [2003]. (emphasis added)

Extremely detailed survey with copious references. Dates from 2006.

Do you know of a later cross-field survey that updates this article?

This would make a good article for the Reading club on Graph databases and distributed systems.

A Discussion on the Design of Graph Database Benchmarks

Friday, February 24th, 2012

A Discussion on the Design of Graph Database Benchmarks by David Dominguez-Sal, Norbert Martinez-Bazan, Victor Muntes-Mulero, Pere Baleta, and Josep Lluis Larriba-Pey.


Graph Database Management systems (GDBs) are gaining popularity. They are used to analyze huge graph datasets that are naturally appearing in many application areas to model interrelated data. The objective of this paper is to raise a new topic of discussion in the benchmarking community and allow practitioners having a set of basic guidelines for GDB benchmarking. We strongly believe that GDBs will become an important player in the market field of data analysis, and with that, their performance and capabilities will also become important. For this reason, we discuss those aspects that are important from our perspective, i.e. the characteristics of the graphs to be included in the benchmark, the characteristics of the queries that are important in graph analysis applications and the evaluation workbench.

An in depth discussion of graph benchmarks with pointers to additional literature. I found Table 1, “Graph Operations, Areas of Interest and Categorization” particularly useful as a quick reference when exploring graph benchmark literature.

Having a ChuQL at XML on the Cloud

Friday, February 24th, 2012

Having a ChuQL at XML on the Cloud by Shahan Khatchadourian, Mariano P. Consens, and Jérôme Siméon.


MapReduce/Hadoop has gained acceptance as a framework to process, transform, integrate, and analyze massive amounts of Web data on the Cloud. The MapReduce model (simple, fault tolerant, data parallelism on elastic clouds of commodity servers) is also attractive for processing enterprise and scienti c data. Despite XML ubiquity, there is yet little support for XML processing on top of MapReduce.

In this paper, we describe ChuQL, a MapReduce extension to XQuery, with its corresponding Hadoop implementation. The ChuQL language incorporates records to support the key/value data model of MapReduce, leverages higher-order functions to provide clean semantics, and exploits side-e ffects to fully expose to XQuery developers the Hadoop framework. The ChuQL implementation distributes computation to multiple XQuery engines, providing developers with an expressive language to describe tasks over big data.

The aggregation and co-grouping were the most interesting examples for me.

The description of ChuQL was a bit thin. Pointers to more resources would be appreciated.

Entity Matching for Semistructured Data in the Cloud

Friday, February 24th, 2012

Entity Matching for Semistructured Data in the Cloud by Marcus Paradies.

From the slides:

Main Idea

  • Use MapReduce and ChuQL to process semistructured data
  • Use a search-based blocking to generate candidate pairs
  • Apply similarity functions to candidate pairs within a block

Uses two of my favorite sources, CiteSeer and Wikipedia.

Looks like the start of an authoring stage of topic map work flow to me. You?

16 Degrees of the WWW (Graph Database Benchmark)

Friday, February 24th, 2012

I was surprised to learn from Challenges in the Design of a Graph Database Benchmark by Marcus Paradies that 16 is the minimum number of hops between 97% of the nodes on the WWW. (Slide 20)

Kevin Bacon is only 6 degrees away. That should make you curious about the WWW as a graph if nothing else does.

You will also need: Slides – Challenges in the Design of a Graph Database Benchmark (The video is good but not good enough to capture the slide details.)

From the description:

Graph databases are one of the leading drivers in the emerging, highly heterogeneous landscape of database management systems for non-relational data management and processing. The recent interest and success of graph databases arises mainly from the growing interest in social media analysis and the exploration and mining of relationships in social media data. However, with a graph-based model as a very flexible underlying data model, a graph database can serve a large variety of scenarios from different domains such as travel planning, supply chain management and package routing.

During the past months, many vendors have designed and implemented solutions to satisfy the need to efficiently store, manage and query graph data. However, the solutions are very diverse in terms of the supported graph data model, supported query languages, and APIs. With a growing number of vendors offering graph processing and graph management functionality, there is also an increased need to compare the solutions on a functional level as well as on a performance level with the help of benchmarks. Graph database benchmarking is a challenging task. Already existing graph database benchmarks are limited in their functionality and portability to different graph-based data models and different application domains. Existing benchmarks and the supported workloads are typically based on a proprietary query language and on a specific graph-based data model derived from the mathematical notion of a graph. The variety and lack of standardization with respect to the logical representation of graph data and the retrieval of graph data make it hard to define a portable graph database benchmark. In this talk, we present a proposal and design guideline for a graph database benchmark. Typically, a database benchmark consists of a synthetically generated data set of varying size and varying characteristics and a workload driver. In order to generate graph data sets, we present parameters from graph theory, which influence the characteristics of the generated graph data set. Following, the workload driver issues a set of queries against a well-defined interface of the graph database and gathers relevant performance numbers. We propose a set of performance measures to determine the response time behavior on different workloads and also initial suggestions for typical workloads in graph data scenarios. Our main objective of this session is to open the discussion on graph database benchmarking. We believe that there is a need for a common understanding of different workloads for graph processing from different domains and the definition of a common subset of core graph functionality in order to provide a general-purpose graph database benchmark. We encourage vendors to participate and to contribute with their domain-dependent knowledge and to define a graph database benchmark proposal.

What do you think of focusing benchmark efforts on a simple property graph model? (Slide 27)

Perhaps not a bad starting place but I would prefer a roadmap that includes multi-graphs and hypergraphs.

Cassandra Europe! Wednesday March 28 – London

Friday, February 24th, 2012

Cassandra Europe! Wednesday March 28 – London

From the announcement:

Acunu is proud to announce the first Apache Cassandra Europe Conference in London on March 28. This is a packed one-day event with two tracks – ‘Case Studies’ and ‘Cassandra 101 – Beat the learning curve’. Get your early bird ticket!

Who should attend?

If you’re using Cassandra and looking for better support or performance tips, or if you’re wondering what all the fuss is about and want to learn more, you should attend!

Experts from Acunu will be on hand to share insights and we’ll have a drop-in room where attendees can turn up for help and advice with Cassandra problems.

We’ll be tweeting with hashtag #cassandraeu. For any comments or questions, contact Konrad Kennedy.

Sign up -win an iPad2!

An iPad2 drawing isn’t enough to get me to London but a Cassandra conference could tip the balance. How about you?

Social Media & the FBI

Friday, February 24th, 2012

I pointed to the FBI RFI on Social Media mining innocently enough. Before the privacy advocates got into full voice.

Your privacy isn’t in any danger from this proposal from the FBI.

Yes, it talks about mining social media but it also says its objectives are:

  • Provide a user defined operations pictures (UDOP) that are flexible to support a myriad of functional FBI missions. Examples include but are not limited to: Reconnaissance & Surveillance, NSSE Planning, NSSE Operations, SIOC Operations, Counter Intelligence, Terrorism, Cybercrime, etc.
  • To improve the FBI SIOC’s open source intelligence collection capabilities by establishing a robust open source platform that has the flexibility to change search parameters and geo-locate the search based on breaking events or emerging threats.
  • Improve and accelerate the speed by which the FBI SIOC is alerted, vetted and notified of breaking events and emerging threats to more effectively notify the appropriate FO. LEGAT or OGA. (push vs. pull)
  • Provide FBI Executive Management with enhanced strategic, operational and tactical information for improved decision making
  • Empower the FBI SIOC with rapid self-service application to quickly adjust open source “search” parameters to a breaking event, crisis, and emerging threats.

Do you wonder what they mean by “open source?” Or do they intend to contract for “open source” in the Apache sense for do-it-yourself spyware?

The “…include but are not limited to: Reconnaissance & Surveillance, NSSE Planning, NSSE Operations, SIOC Operations, Counter Intelligence, Terrorism, Cybercrime, etc.” reminds me of the > 700,000 lines of code from the Virtual Case File project at the FBI.

The objective that makes me feel safe is: “Provide FBI Executive Management with enhanced strategic, operational and tactical information for improved decision making”

Does that help you realize this set of “objectives” was written by some FBI executive leafing through Wired magazine and just jotting down words and phrases?

I am sure there are some cutting edge applications that could be developed for the FBI. That would further its legitimate mission(s).

But unless and until the requirements for those applications are developed by, for and with the FBI personnel actively performing those missions, prior to seeking input from vendors, this is just another $170 Million rat-hole.

To be very clear, requirements should be developed by parties who have no interest in the final contract or services.

From Graph (batch) processing towards a distributed graph data base

Friday, February 24th, 2012

From Graph (batch) processing towards a distributed graph data base by René Pickhardt.

From the post:

Yesterdays meeting of the reading club was quite nice. We all agreed that the papers where of good quality and we gained some nice insights. The only drawback of the papers was that it did not directly tell us how to achieve our goal for a real time distributed graph data base technology. In the readings for next meeting (which will take place Wednesday March 7th 2pm CET) we tried to choose papers that don’t discuss these distributed graph / data processing techniques but   focus more on speed or point out the general challenges in parallel graph processing.

Readinglist for next Meeting (Wednesday March 7th 2pm CET)

Again while reading an preparing stuff feel free to add more reading wishes to the comments of this blog post or drop me a mail!

That’s two weeks from yesterday: Wednesday March 7th 2pm CET.

Effective #hashtags

Friday, February 24th, 2012

I ran across trying to be more effective with Twitter posts about topic maps.

If I post on the graph database Neo4j, which hashtag should I use: #neo4j or #Neo4j?

In efforts to communicate, saying that listeners “need to be educated,” a phrase from #semanticweb (SemanticWeb, semanticWeb?) circles, is a poor strategy. If your goal is to communicate.

Speakers should use words and phrases listeners are likely to understand.

For Twitter, that means using the most common hash tags for any given subject.

Otherwise you are sexting with different key combinations than everyone else.

Has to be frustrating. 😉

Oh, the numbers:

Variants: neo4j:79% Neo4j:21%

Variants: semanticweb:66% SemanticWeb:31% semanticWeb:3%

With some research you can improve your Twitter communication skills.

A Well-Woven Study of Graphs, Brains, and Gremlins

Friday, February 24th, 2012

A Well-Woven Study of Graphs, Brains, and Gremlins by Marko Rodriguez.

From the post:

What do graphs and brains have in common? First, they both share a relatively similar structure: Vertices/neurons are connected to each other by edges/axons. Second, they both share a similar process: traversers/action potentials propagate to effect some computation that is a function of the topology of the structure. If there exists a mapping between two domains, then it is possible to apply the processes of one domain (the brain) to the structure of the other (the graph). The purpose of this post is to explore the application of neural algorithms to graph systems.

Entertaining and informative post by Marko Rodriguez comparing graphs, brains and the graph query language Gremlin.

I agree with Marko on the potential of graphs but am less certain than I read him to be on how well we understand the brain. Both the brain and graphs have many dark areas yet to be explored. As we shine new light on one place, more unknown places are just beyond the reach of our light.

HyperDex: A Distributed, Searchable Key-Value Store for Cloud Computing

Thursday, February 23rd, 2012

HyperDex: A Distributed, Searchable Key-Value Store for Cloud Computing by Robert Escrivay, Bernard Wongz and Emin Güun Sirery.


Distributed key-value stores are now a standard component of high-performance web services and cloud computing applications. While key-value stores offer significant performance and scalability advantages compared to traditional databases, they achieve these properties through a restricted API that limits object retrieval—an object can only be retrieved by the (primary and only) key under which it was inserted. This paper presents HyperDex, a novel distributed key-value store that provides a unique search primitive that enables queries on secondary attributes. The key insight behind HyperDex is the concept of hyperspace hashing in which objects with multiple attributes are mapped into a multidimensional hyperspace. This mapping leads to efficient implementations not only for retrieval by primary key, but also for partially-specified secondary attribute searches and range queries. A novel chaining protocol enables the system to provide strong consistency guarantees while supporting replication. An evaluation of the full system shows that HyperDex is orders of magnitude faster than Cassandra and MongoDB for finding partially specified objects. Additionally, HyperDex achieves high performance for simple get/put operations compared to current state-of-the-art key-value stores, with stronger fault tolerance and comparable scalability properties.

This paper merited a separate posting from the software.

Among many interesting points was the following one from the introduction:

A naive Euclidean space construction, however, can suffer from the “curse of dimensionality,” as the space exhibits an exponential increase in volume with each additional secondary attribute [8]. For objects with many attributes, the resulting Euclidean space would be large, and consequently, sparse. Nodes would then be responsible for large regions in the hyperspace, which would increase the number of nodes whose regions intersect search hyperplanes and thus limit the effectiveness of the basic approach. HyperDex addresses this problem by introducing an efficient and lightweight mechanism that partitions the data into smaller, limited-size sub-spaces, where each subspace covers a subset of object attributes in a lower dimensional hyperspace. Thus, by folding the hyperspace back into a lower number of dimensions, HyperDex can ensure higher node selectivity during searches.

Something keeps nagging at me about the use of the term Euclidean space. Since a Euclidean space is a metric space, I “get” how they can partition metric data into smaller sub-spaces.

Names don’t exist in metric spaces but sort orders and frequencies are known well enough to approximate such a solution. Or are they? I assume for more common languages that is the case but that is likely a poor assumption on my part.

What of other non-metric space values? On what basis would they be partitioned?

Extending Hadoop beyond MapReduce

Thursday, February 23rd, 2012

Extending Hadoop beyond MapReduce

From the webpage:

Wednesday, March 7, 2012 10:00 am
Pacific Standard Time (San Francisco, GMT-08:00)


Hortonworks has been developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource management fabric to support MapReduce and other application paradigms such as Graph Processing, MPI etc. High-availability is built-in from the beginning; as are security and multi-tenancy to support multiple users and organizations on large, shared clusters. The new architecture will also increase innovation, agility and hardware utilization. NextGen MapReduce is already available in Hadoop 0.23. Join us for this webcast as we discuss the main architectural highlights of MapReduce and its utility to users and administrators.

I registered to learn more about the recent changes to Hadoop.

But I am also curious if the discussion is going to be “beyond MapReduce” as in the title or “the main architectural highlights of MapReduce” as in the last sentence. Hard to tell from the description.

HyperDex: A Searchable Distributed Key-Value Store

Thursday, February 23rd, 2012

HyperDex: A Searchable Distributed Key-Value Store

From the webpage:

HyperDex is a distributed, searchable key-value store. HyperDex provides a unique search primitive which enables searches over stored values. By design, HyperDex retains the performance of traditional key-value stores while enabling support for the search operation.

The key features of HyperDex are:

  • Fast HyperDex has lower latency and higher throughput than most other key-value stores.
  • Searchable HyperDex enables lookups of non-primary data attributes. Such searches are implemented efficiently and contact a small number of servers.
  • Scalable HyperDex scales as more machines are added to the system.
  • Consistent The value you GET is always the latest value you PUT. Not just "eventually," but immediately and always.
  • Fault tolerant HyperDex handles failures. Data is automatically
    replicated on multiple machines so that failures do not cause data loss.

Source code is available subject to this license.

European Commission launches consultation into e-interoperability

Thursday, February 23rd, 2012

European Commission launches consultation into e-interoperability by Derek du Preez.

From the post:

The European Commission (EC) has launched a one month public consultation into the problem of incompatible vocabularies used by developers of public administration IT systems.

“Core vocabularies” are used to make sharing and reusing data easier, and the EC hopes that if they are defined properly, it will be able to quickly and effectively launch e-Government cross-border services.

The EC has divided the consultation into three separate core vocabularies; person, business and location.

Despite the minimal nature of the core vocabularies, I think the expectations for their use is set by the final paragraph of this report:

Once the public consultation is over, the working groups will seek endorsement from EU Member States. This means that the vocabularies will not become a legal obligation, but will give them further exposure for wider use.

If you have pointers to current incompatible vocabularies, I would appreciate a ping. Just so we can revisit those vocabularies in say five years to see the result of “exposure for wider use.”

Downloadable Version of FAST Now Available

Thursday, February 23rd, 2012

Downloadable Version of FAST Now Available

Just in case you are in need of “an enumerative, faceted subject heading schema derived from the Library of Congress Subject Headings (LCSH).”

Thought that would get your attention. Details from the announcement follow:

OCLC Research has made FAST (Faceted Application of Subject Terminology) available for bulk download, along with some minor improvements based on user feedback and routine updates. As with other FAST data, the bulk downloadable versions are available at no charge.

FAST is an enumerative, faceted subject heading schema derived from the Library of Congress Subject Headings (LCSH). OCLC made FAST available as Linked Open Data in December 2011.

The bulk downloadable versions of FAST are offered at no charge. Like FAST content available through the FAST Experimental Linked Data Service, the downloadable versions of FAST are made available under the Open Data Commons Attribution (ODC-By) license.

FAST may be downloaded in either SKOS/RDF format or MARC XML (Authorities format). Users may download the entire FAST file including all eight facets (Personal Names, Corporate Names, Event, Uniform Titles, Chronological, Topical, Geographic, Form/Genre) or choose to download individual facets (see the download information page for more details).

OCLC has enhanced the VoID (“Vocabulary of Interlinked Datasets”) dataset description for improved ease of processing of the license references. Several additions and changes to FAST headings have been made in the normal course of processing new and changed headings in LCSH. OCLC will continue to periodically update FAST based on new and changed headings in LCSH.

About FAST

The FAST authority file, which underlies the FAST Linked Data release, has been created through a multi-year collaboration of OCLC Research and the Library of Congress. Specifically, it is designed to make the rich LCSH vocabulary available as a post-coordinate system in a Web environment. For more information, see the FAST activity page.