Archive for May, 2012

Applied topology and Dante: an interview with Robert Ghrist [Sept., 2010]

Monday, May 28th, 2012

Applied topology and Dante: an interview with Robert Ghrist by John D. Cook. (September 13, 2010)

From the post:

Robert Ghrist A few weeks ago I discovered Robert Ghrist via his web site. Robert is a professor of mathematics and electrical engineering. He describes his research as applied topology, something I’d never heard of. (Topology has countless applications to other areas of mathematics, but I’d not heard of much work directly applying topology to practical physical problems.) In addition to his work in applied topology, I was intrigued by Robert’s interest in old books.

The following is a lightly-edited transcript of a phone conversation Robert and I had September 9, 2010.

If the interview sounds interesting, you may want to read/skim:

[2008] R. Ghrist, “Three examples of applied and computational homology,” Nieuw Archief voor Wiskunde 5/9(2).


[2010] R. Ghrist, “Applied Algebraic Topology & Sensor Networks,” a manu-script text. (caveat! file>50megs!)

Applied Topology & Sensor Networks are the notes for an AMS short course. Ghrist recommends continuing with Algebraic Toplogy by Allen Hatcher. (Let me know if you need my shipping address.)

Q: Are sensors always mechanical sensors? We speak of them as though that were the case.

What if I can’t afford unmanned drones (to say nothing of their pilots) and have $N$ people with cellphones?

How does a more “discriminating” “sensor” impact the range of capabilities/solutions?

“AvocadoDB” becomes “ArangoDB”

Monday, May 28th, 2012

“AvocadoDB” becomes “ArangoDB”

From the post:

to avoid legal issues with some other Avocado lovers we have to change the name of our database. We want to stick to Avocados and selected a variety from Mexico/Guatemala called “Arango”.

So in short words: AvocadoDB will become ArangoDB in the next days, everything else remains the same. 🙂

We are making great progress towards version 1 (deadline is end of May). The simple query language is finished and documented and the more complex ArangoDB query language (AQL) is mostly done. So stay tuned. And: in case you know someone who is a node.js user and interesting in writing an API for ArangoDB: let me know!

We will all shop with more confidence knowing the “avocado” at Kroger isn’t a noSQL database masquerading as a piece of fruit.

Another topic map type issue: There are blogs, emails (public and private), all of which refer to “AvocadoDB.” Hard to pretend those aren’t “facts.” The question will be how to index “ArangoDB” so that we pick up prior traffic on “AvocadoDB?”

Such as design or technical choices made in “AvocadoDB” that are the answers to issues with “ArangoDB.”

Short Intro to Graph Databases, Manipulating and Traversing With Gremlin

Monday, May 28th, 2012

Short Intro to Graph Databases, Manipulating and Traversing With Gremlin

Alex Popescu at myNoSQL captures a slide deck by Pierre De Wilde, “A Walk in Graph Databases.”

Has extensive examples using Gremlin after a short graph theory introduction.

Amusing graphic of everything looking like a table if all you have is a relational database.

Truth is that everything looks like a graph from a certain point of view.

Design question: What graph qualities, if any, are appropriate for your data and goals?

Always possible that graph representation or properties are inappropriate for your project.

Streaming Analytics: with sparse distributed representations

Monday, May 28th, 2012

Streaming Analytics: with sparse distributed representations by Jeff Hawkins.


Sparse distributed representations appear to be the means by which brains encode information. They have several advantageous properties including the ability to encode semantic meaning. We have created a distributed memory system for learning sequences of sparse distribute representations. In addition we have created a means of encoding structured and unstructured data into sparse distributed representations. The resulting memory system learns in an on-line fashion making it suitable for high velocity data streams. We are currently applying it to commercially valuable data streams for prediction, classification, and anomaly detection In this talk I will describe this distributed memory system and illustrate how it can be used to build models and make predictions from data streams.


Looking forward to learning more about “sparse distributed representation (SDR).”

Not certain about Jeff’s claim that matching across SDRs = semantic similarity.

Design of the SDR determines the meaning of each bit and consequently of matching.

Which feeds back into the encoders that produce the SDRs.

Other resources:

The core paper: Hierarchical Temporal Memory including HTM Cortical Learning Algorithms. Check the FAQ link if you need the paper in Chinese, Japanese, Korean, Portuguese, Russian, or Spanish. (unverified translations)

Grok – Frequently Asked Questions

A very good FAQ that goes a long way to explaining the capabilities and limitations (currently) of Grok. “Unstructured text” for example isn’t appropriate input into Grok.

Jeff Hawkins and Sandra Blakeslee co-authored On Intelligence in 2004. The FAQ describes the current work as an extension of “On Intelligence.”

BTW, if you think you have heard the name Jeff Hawkins before, you have. Inventor of the Palm Pilot among other things.


Monday, May 28th, 2012


From the documentation:

Whoosh is a fast, pure Python search engine library.

The primary design impetus of Whoosh is that it is pure Python. You should be able to use Whoosh anywhere you can use Python, no compiler or Java required.

Like one if its ancestors, Lucene, Whoosh is not really a search engine, it’s a programmer library for creating a search engine [1].

Practically no important behavior of Whoosh is hard-coded. Indexing of text, the level of information stored for each term in each field, parsing of search queries, the types of queries allowed, scoring algorithms, etc. are all customizable, replaceable, and extensible.

[1] It would of course be possible to build a turnkey search engine on top of Whoosh, like Nutch and Solr use Lucene.

Haven’t inventoried script based search engines but perhaps I should.

Experiments with indexing/search behaviors might be easier (read more widespread) with scripting languages.


Compare SearchBlox 7.0 vs. Solr

Monday, May 28th, 2012

Compare SearchBlox 7.0 vs. Solr by Timo Selvaraj.

From the post:

SearchBlox 7 is a (free) enterprise solution for website, ecommerce, intranet and portal search. The new 7.0 version makes it easy to add faceted search without the hassles of managing a schema and scales horizontally without any manual configuration or external software/scripts. SearchBlox enables you to achieve term, range and date based faceted search without manually maintaining a schema file as in Solr. SearchBlox enables to have distributed indexing and searching abilities without using any separate scripts/programs as in SolrCloud. SearchBlox provides on demand dynamic faceting of fields without specifying them through a config or script.

Expecting a comparison of SearchBlox 7.0 and Solr?

You are going to be disappointed.

Summary of what Timo thinks about SearchBlox 7.0.

Not a bad thing, just not a basis for comparison.

That you have to supply yourself.

I am going to throw a copy of SearchBlox 7.0 on the fire later this week.

Mihai Surdeanu

Sunday, May 27th, 2012

I ran across Mihai Surdeanu‘s publication page while hunting down an NLP article.

There are pages for software and other resources as well.


Facebook-class social network analysis with R and Hadoop

Sunday, May 27th, 2012

Facebook-class social network analysis with R and Hadoop

From the post:

In computing, social networks are traditionally represented as graphs: a connection of nodes (people), pairs of which may be connected by edges (friend relationships). Visually, the social networks can then be represented like this:

[graphic omitted]

Social network analysis often amounts to calculating the statistics on a graph like this: the number of edges (friends) connected to a particular node (person), and the distribution of the number of edges connected to nodes across the entire graph. When the graph consists of up to 10 billion elements (nodes and edges), such computations can be done on a single server with dedicated graph software like Neo4j. But bigger networks — like Facebook’s social network, which is a graph with more than 60 billion elements — require a distributed solution.

Pointer to a Marko A. Rodriguez post that describes how to use R and Hadoop on networks of scale.

Worth your time.

The Search Is Over: Integrating Solr and Hadoop to Simplify Big Data Analytics

Sunday, May 27th, 2012

The Search Is Over: Integrating Solr and Hadoop to Simplify Big Data Analytics

From MapR Technologies.

Show of hands. How many of you can name the solution found in these slides?


Slides are great for entertainment.

Solutions require more, a great deal more.

For the “more” on MapR, see: Download Hadoop Software Datasheets, Product Documentation, White Papers

The Seven Deadly Sins of Solr

Sunday, May 27th, 2012

The Seven Deadly Sins of Solr by Jay Hill.

From the post:

Working at Lucid Imagination gives me the opportunity to analyze and evaluate a great many instances of Solr implementations, running in some of the largest Fortune 500 companies as well as some of the smallest start-ups. This experience has enabled me to identify many common mistakes and pitfalls that occur, either when starting out with a new Solr implementation, or by not keeping up with the latest improvements and changes.Thanks to my colleague Simon Rosenthal for suggesting the title, and to Simon, Lance Norskog, and Tom Hill for helpful input and suggestions.So, without further ado…the Seven Deadly Sins of Solr.

Not recent and to some degree Solr specific.

You will encounter one or more of these “sins” with every IT solution, including topic maps.

This should be fancy printed, laminated and handed out as swag.

LREC Conferences

Sunday, May 27th, 2012

LREC Conferences

From the webpage:

The International Conference on Language Resources and Evaluation is organised by ELRA biennially with the support of institutions and organisations involved in HLT.

LREC Conferences bring together a large number of people working and interested in HLT.

Full proceedings, including workshops, tutorials, papers, etc., are available from 2002 forward!

I almost forgot to hit “save” for this post because I was reading a tutorial on Arabic parsing. 😉

You really owe it to yourself to see this resource.

Hundreds of papers at each conference on issues relevant to your processing of texts.

Getting a paper accepted here should be your goal after seeing the prior proceedings!

Once you get excited about the prior proceedings and perhaps attending in the future, here is my question:

How do you make the proceedings from prior conferences effectively available?

Subject to indexing/search over the WWW now but that isn’t what I mean.

How do you trace the development of techniques or ideas across conferences or papers, without having to read each and every paper?

Moreover, can you save those who follow you the time/trouble of reading every paper to duplicate your results?

SPLeT 2012: Workshop on Semantic Processing of Legal Texts

Sunday, May 27th, 2012

SPLeT 2012: Workshop on Semantic Processing of Legal Texts

Legal Informatics has a listing of the papers from the SPLeT 2012 workshop.

You know, with an acronym like that, you wonder why you missed it in the past. 😉

In case you did miss it in the past:

SPLet 2010 Proceedings at Legal Informatics

SPLet 2008 Workshop on Semantic Processing of Legal Texts.

Selected papers from SPLet 2008 were expanded into: Semantic Processing of Legal Texts Where the Language of Law Meets the Law of Language edited by Enrico Francesconi, Simonetta Montemagni, Wim Peters and Daniela Tiscornia.

Sixty-three (63) pages (2008 proceedings) versus two hundred forty-nine (249) for the Springer title. I don’t have the printed volume so can’t comment on the value of the expansion. (Even “used” the paperback is > $50.00 US. I would borrow a copy before ordering.)

Assuming meetings every two years, SPLet 2006 should have been the first workshop. That workshop apparently did not co-locate with LREC. A pointer to the workshop and proceedings if possible would be appreciated.

Neo4j 1.8.M03 – Related Coffee [1st comments on RELATE]

Saturday, May 26th, 2012

Neo4j 1.8.M03 – Related Coffee

From the post:

Released today, Neo4j 1.8.M03 introduces the RELATE clause, a two-step dance of MATCH or CREATE. Also, new Transaction support in the shell let’s you call for a do-over in case you misstep.

The blog entry also points you to the Neo4j Manual entry on Cypher RELATE

My main machine is still down so I will have to wait to test the latest release (new equipment arrives on Tuesday of the coming week).

But in the mean time, if you read the manual entry, the first paragraph says:

RELATE is in the middle of MATCH and CREATE – it will match what it can, and create what is missing. RELATE will always make the least change possible to the graph – if it can use parts of the existing graph, it will.

I don’t need a working copy of Neo4j/Cypher to say that “RELATE will always make the least change possible to the graph – if it can use part of the existing graph, it will.” is an odd statement.

How is RELATE going to determine the “…least change possible to the graph…”?

That sounds like a very hard problem.

More comments to follow on RELATE.

Starcounter To Be Fastest ACID Adherent NewSQL Database

Saturday, May 26th, 2012

Starcounter To Be Fastest ACID Adherent NewSQL Database by Sudheer Vatsavaya.

Starcounter has last week said that its launch of in-memory database is capable to process millions of transactions per second on a single server. Such a database is designed on its patent pending VMDBMS technology which offers combined power of virtual machine (VM) and Database management system (DBMS) to process the data at required volumes and speeds.

The company claims Starcounter to be more than 100 times faster than traditional databases and 10 times faster than high performance databases, the new in-memory database is ideal for highly transactional large-scale and real-time applications. It can handle millions of users, integrate with applications to increase performance, and guarantee consistency by processing millions of ACID-compliant database transactions per second while managing up to a terabyte of updatable data on a single server.

Few things that clearly come out in the design and ambition of company is the belief that the way ahead is not SQL or NoSQL but its NewSQL which adheres to ACID attributes and at the same time overcomes the issue of being scalable to todays data scalability needs. This cannot be achieved in either of the former types of databases. While SQL structured databases cannot scale upto the needs, NOSQL databases are built around CAP theorem that says either of the three parameters Availability, Consistency or Partition tolerance has to be compromised.

Sounds interesting but runs on .Net.

I will have to rely on reports from others.

Outlier detection in two review articles (Part 2) (TM use case on Identifiers)

Saturday, May 26th, 2012

Outlier detection in two review articles (Part 2) by Sandro Saitta.

From the post:

Here we go with the second review article about outlier detection (this post is the continuation of Part I).

A Survey of Outlier Detection Methodologies

This paper, from Hodge and Austin, is also an excellent review of the field. Authors give a list of keywords in the field: outlier detection, novelty detection, anomaly detection, noise detection, deviation detection and exception mining. For the authors, “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Grubbs, 1969)”. Before listing several application in the field, authors mention that an outlier can be “surprising veridical data“. It may only be situated in the wrong class.

An interesting list of possible reasons for outliers is given: human error, instrument error, natural deviations in population, fraudulent behavior, changes in behavior of system and faults in system. Like in the first article, Hodge and Austin define three types of approaches to outlier detection (unsupervised, supervised and semi-supervised). In the last one, they mention that some algorithms can allow a confidence in the fact that the observation is an outlier. Main drawback of the supervised approach is its inability to discover new types of outliers.

While you are examining the techniques, do note the alternative ways to identify the problem.

Can you say topic map? 😉

Simple query expansion, assuming that any single term return hundreds of papers, isn’t all that helpful. Instead of several hundred papers you get several thousand. Gee, thanks.

But that isn’t an indictment of alternative identifications of subjects, that is a problem of granularity.

Returning documents forces users to wade through large amounts of potentially irrelevant content.

The question is how to retain alternative identifications of subjects while returning a manageable (or configurable) amount of content?


Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone

Saturday, May 26th, 2012

Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone

Alex Popescu calls attention to Ryan Kennedy of Yammer presenting on transitioning from PostgreSQL to Berkeley DB.

Is that the right direction?

Watch the presentation and see what you think.

Doug Mahugh Live! was: MongoDB Replica Sets

Saturday, May 26th, 2012

Doug Mahugh spotted on MongoDB Replica Sets.

The video also teaches you about MongoDB replica sets on Windows. Replica sets being the means MongoDB uses for high reliability and read performance. An expert from 10gen, Sridhar Nanjundeswaran, covers the MongoDB stuff.

PS: Kudos to Doug on his new role at MS on reaching out to open source projects!

OData submitted to OASIS for standardization

Friday, May 25th, 2012

OData submitted to OASIS for standardization by Doug Mahugh.

From the post:

Citrix, IBM, Microsoft, Progress Software, SAP AG, and WSO2 have submitted a proposal to OASIS to begin the formal standardization process for OData. You can find all the details here, and OData architect Pablo Castro also provides some context for this announcement over on the blog. It’s an exciting time for the OData community!

OData is a REST-based web protocol for querying and updating data, and it’s built on standardized technologies such as HTTP, Atom/XML, and JSON. If you’re not already familiar with OData, the web site is the best place to learn more.

It’s nice to see all the usual suspects lined up in favor of the same data query/update standard!

Not that it solves all the problems, there are still questions of semantics, to say nothing of a lot of details to be worked out along the way.

My suggestion would be that you drop by OASIS, which qualifies as one of the “best buys” in standards land to see what role you want to take in the OData standardization process.

(Full disclosure: I am a long term member of OASIS and a member of Technical Advisory Board (TAB).)

Popular Queries

Friday, May 25th, 2012

Popular Queries by Hugh Williams.

From the post:

I downloaded the (infamous) AOL query logs a few days back, so I could explore caching in search. Here’s a few things I learnt about popular queries along the way.

There isn’t that much user search data around so I thought it would be worth recording the post for that reason if no other.

Apache Hadoop 2.0 (Alpha) Released

Friday, May 25th, 2012

Apache Hadoop 2.0 (Alpha) Released by Arun Murthy.

From the post:

As the release manager for the Apache Hadoop 2.0 release, it gives me great pleasure to share that the Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)! While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

In addition to these new capabilities, there are several planned enhancements that are on the way from the community, including HDFS Snapshots and auto-failover for HA NameNode, along with further improvements to the stability and performance with the next generation of MapReduce (YARN). There are definitely good times ahead.

Let the good times roll!

Image compositing in TileMill

Friday, May 25th, 2012

Image compositing in TileMill by Kim Rees.

From the post:

TileMill is a tool that makes it easy to create interactive maps. Soon they will be adding some new features that will treat maps more like images in terms of modifying the look and feel. This will allow you to apply blending to polygons and GIS data.

BTW, a direct link for TileMill.

On brief glance, the TileMill site is very impressive.

Are you tying topic maps to GIS or other types of maps?

Bruce on Legislative Identifier Granularity

Friday, May 25th, 2012

Bruce on Legislative Identifier Granularity

From the post:

In this post, Tom [Bruce] explores legislative identifier granularity, or the level of specificity at which such an identifier functions. The post discusses related issues such as the incorporation of semantics in identifiers; the use of “pure” (semantics-free) legislative identifiers; and how government agency authority and procedural rules influence the use, “persistence, and uniqueness” of identifiers. The latter discussion leads Tom to conclude that

a “gold standard” system of identifiers, specified and assigned by a relatively independent body, is needed at the core. That gold standard can then be extended via known, stable relationships with existing identifier systems, and designed for extensible use by others outside the immediate legislative community.

Interesting and useful reading.

Even though a “gold standard” of identifiers for something as dynamic as legislation, isn’t likely.

Or rather, isn’t going to happen.

There are too many stakeholders in present systems for any proposal to carry the day.

Not to mention decades, if not centuries, of references in other systems.

Silo Indictment #1,000,001

Friday, May 25th, 2012

Derek Miers writes what may be the 1,000,001st indictment of silos in Silos and Functional Decomposition:

I think we would all agree that BPM and business architecture set out to overcome the issues associated with silos. And I think we would also agree that the problems associated with silos derive from functional decomposition.

While strategy development usually takes a broad, organization-wide view, so many change programs still cater to the sub-optimization perspectives of individual silos. Usually, these individual change programs consist of projects that deal with the latest problem to rise to the top of the political agenda — effectively applying a Band-Aid to fix a broken customer-facing process or put out a fire associated with some burning platform.

Silo-based thinking is endemic to Western culture — it’s everywhere. This approach to management is very much a command-and-control mentality injected into our culture by folks like Smith, Taylor, Newton and Descartes. Let’s face it: the world has moved on, and the network is now far more important than the hierarchy.

But guess what technique about 99.9% of us use to fix the problems associated with functional decomposition? You guessed it: yet more functional decomposition. I think Einstein had something to say about using the same techniques and expecting different results. This is a serious groupthink problem!

When we use functional decomposition to model processes, we usually conflate the organizational structure with the work itself. Rather than breaking down the silos, this approach reinforces them — effectively putting them on steroids. And when other techniques emerge that explicitly remove the conflation of process and organizational structure, those who are wedded to old ways of thinking come out of the woodwork to shoot them down. Examples include role activity diagrams (Martyn Ould), value networks (Verna Allee), and capability mapping (various authors, including Forrester analysts).

Or it may be silo indictment #1,000,002, it is hard to keep an accurate count.

I don’t doubt a word that Derek says, although I might put a different emphasis on parts of it.

But in any case, let’s just emphasize agreement that silos are a problem.

Now what?

The Data Lifecycle, Part One: Avroizing the Enron Emails

Friday, May 25th, 2012

The Data Lifecycle, Part One: Avroizing the Enron Emails by Russell Jurney.

From the post:

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

We hope that this dataset can become a sort of common set for examples and questions, as anonymizing one’s own data in public forums can make asking questions and getting quality answers tricky and time consuming.

More information about the Enron Emails is available:

Covering the data lifecycle in any detail is a rare event.

To do so with a meaningful data set is even rarer.

You will get the maximum benefit from this series by “playing along” and posting your comments and observations.

Build your own twitter like real time analytics – a step by step guide

Friday, May 25th, 2012

Build your own twitter like real time analytics – a step by step guide

Where else but High Scalability would you find a “how-to” article like this one? Complete with guide and source code.

Good DYI project for the weekend.

Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.

In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:

  1. Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
  2. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
  3. Use Cloudify ( for managing and automating the deployment on private or pubic cloud

The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique.

Then we need persist and process the data with low latency, and for this we store the tweets in memory.

Automated harvesting of tweets has real potential, even with clear text transmission. Or perhaps because of it.

Role Modeling

Friday, May 25th, 2012

Role Modeling

From the webpage:

Roles are about objects and how they interact to achieve some purpose. For thirty years I have tried to get them into the into the main stream, but haven’t succeeded. I believe the reason is that our programming languages are class oriented rather than object oriented. So why model in terms of objects when you cannot program them?

Almost all my documents are about role modeling in one form or another. There are two very useful abstractions on objects. One abstraction classifies objects according to their properties. The other studies how objects work together to achieve one or more of the users’ goals. I have for the past 30 years tried to make our profession aware of this important dichotomy, but have met with very little success. The Object Management Group (OMG) has standardized the Unified Modeling Language, UML. We were members of the core team defining this language and our role modeling became part of the language under the name of Collaborations. Initially, very few people seemed to appreciate the importance of the notion of Collaborations. I thought that this would change when Ivar Jacobson came out with his Use Cases because a role model shows how a system of interacting objects realizes a use case, but it is still heavy going. There are encouaging signs in the concept of Components in the emerging UML version 2.0. Even more encouaging is the ongoing work with Web Services where people and components are in the center of interest while classes are left to the specialists. My current project, BabyUML, binds it all together: algorithms coded as classes + declaration of semantic model + coding of object interaction as collaborations/role models.

The best reference is my book Working With Objects. Out of print, but is still available from some bookshops including Amazon as of January 2010.

You can download the pdf of Working with Objects (version before publication). A substantial savings over the Amazon “new” price of $100+ US.

This webpage has links to a number resources from Trygve M. H. Reenskaug on role modeling.

I saw this reference in a tweet by Inge Henriksen.

Latent Multi-group Membership Graph Model

Thursday, May 24th, 2012

Latent Multi-group Membership Graph Model by Myunghwan Kim and Jure Leskovec.


We develop the Latent Multi-group Membership Graph (LMMG) model, a model of networks with rich node feature structure. In the LMMG model, each node belongs to multiple groups and each latent group models the occurrence of links as well as the node feature structure. The LMMG can be used to summarize the network structure, to predict links between the nodes, and to predict missing features of a node. We derive efficient inference and learning algorithms and evaluate the predictive performance of the LMMG on several social and document network datasets.

Oddly enough, the cited literature in this article cuts off with 1997 and Bearman’s adolescent health survey. I distinctly remember there being network, node, document research prior to 1997.

Not a bad article but I have the feeling I have seen this or something very close to it before. HyTime? With less formalism?

Unless one of my European readers has contributed the solution before I get to the keyboard in the morning (US East Coast), I will take another run at it.

Just so you know, this paragraph, from the introduction, is what caught my eye:

Node features along with the links between them provide rich and complementary sources of information and should be used simultaneously for uncovering, understanding and exploiting the latent structure in the data. In this respect, we develop a new network model considering both the emergence of links of the network and the structure of node features such as user profile information or text of a document.

Visual and semantic interpretability of projections of high dimensional data for classification tasks

Thursday, May 24th, 2012

Visual and semantic interpretability of projections of high dimensional data for classification tasks by Ilknur Icke and Andrew Rosenberg.

A number of visual quality measures have been introduced in visual analytics literature in order to automatically select the best views of high dimensional data from a large number of candidate data projections. These methods generally concentrate on the interpretability of the visualization and pay little attention to the interpretability of the projection axes. In this paper, we argue that interpretability of the visualizations and the feature transformation functions are both crucial for visual exploration of high dimensional labeled data. We present a two-part user study to examine these two related but orthogonal aspects of interpretability. We first study how humans judge the quality of 2D scatterplots of various datasets with varying number of classes and provide comparisons with ten automated measures, including a number of visual quality measures and related measures from various machine learning fields. We then investigate how the user perception on interpretability of mathematical expressions relate to various automated measures of complexity that can be used to characterize data projection functions. We conclude with a discussion of how automated measures of visual and semantic interpretability of data projections can be used together for exploratory analysis in classification tasks.

Rather small group of test subjects (20) so I don’t think you can say much other than more work is needed.

Then it occurred to me that I often speak of studies applying to “users” without stopping to remember that for many tasks, I fall into that self-same category. Subject to the same influences, fatigues and even mistakes.

Anyone know of research by researchers being applied to the same researchers?

TinkerPop2 Release

Thursday, May 24th, 2012

A message from Marko Rodriguez announced the release of TinkerPop2 with notes on the major features of each:

– Massive changes to blueprints-core API
– TreePipe added for exposing the spanning tree of a traversal
– Automatic path and query optimizations (download)
– FramedGraph is simply a wrapper graph in the Blueprints sense
– Synchronicity with the Blueprints API (download)

BTW, Marko says:

As you may know, there are big changes to the API: package renaming, new core API method names, etc. While this may be shocking, it is all worth it. In 2 weeks, there is going to be a release of something very big for which TinkerPop2 will be a central piece of the puzzle. Stay tuned and get ready for a summer of insane, crazy graph madness.

So, something to look forward to!

neo4j: What question do you want to answer?

Thursday, May 24th, 2012

neo4j: What question do you want to answer? by Mark Needham.

From the post:

Over the past few weeks I’ve been modelling ThoughtWorks project data in neo4j and I realised that the way that I’ve been doing this is by considering what question I want to answer and then building a graph to answer it.

Mark’s post and the following comments raise interesting issues about modeling with a graph.

What captures my curiosity is whether “building a graph” is meant in some static sense or can the graph I see be “built” in some more dynamic sense? It may only exist for so long as I view it under certain conditions.