Archive for June, 2013

Big Data Benchmark

Thursday, June 27th, 2013

Big Data Benchmark

From the webpage:

This is an open source benchmark which compares the performance of several large scale data-processing frameworks.


Several analytic frameworks have been announced in the last six months. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger). This benchmark provides quantitative and qualitative comparisons of four sytems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

  • Redshift – a hosted MPP database offered by based on the ParAccel data warehouse.
  • Hive – a Hadoop-based data warehousing system. (v0.10, 1/2013 Note: Hive v0.11, which advertises improved performance, was recently released but is not yet included)
  • Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8 preview, 5/2013)
  • Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.0, 4/2013)

This remains a work in progress and will evolve to include additional frameworks and new capabilities. We welcome contributions.

What is being evaluated?

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF’s), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.

Benchmarks were mentioned in a discussion at the XTM group on LinkedIn.

Not sure these would be directly applicable but should prove to be useful background material.

I first saw this at Danny Bickson’s Shark @ SIGMOD workshop.

Danny points to Reynold Xin’s Shark talk at SIGMOD GRADES workshop. General overview but worth your time.

Danny also points out that Reynold Xin will be presenting on GraphX at the GraphLab workshop Monday July 1st in SF.

I can’t imagine why that came to mind. 😉

What a Great Year for Hue Users! [Semantics?]

Thursday, June 27th, 2013

What a Great Year for Hue Users! by Eva Andreasson.

From the post:

With the recent release of CDH 4.3, which contains Hue 2.3, I’d like to report on the fantastic progress of Hue in the past year.

For those who are unfamiliar with it, Hue is a very popular, end-user focused, fully open source Web UI designed for interaction with Apache Hadoop and its ecosystem components. Founded by Cloudera employees, Hue has been around for quite some time, but only in the last 12 months has it evolved into the great ramp-up and interaction tool it is today. It’s fair to say that Hue is the most popular open source GUI for the Hadoop ecosystem among beginners — as well as a valuable tool for seasoned Hadoop users (and users generally in an enterprise environment) – and it is the only end-user tool that ships with Hadoop distributions today. In fact, Hue is even redistributed and marketed as part of other user-experience and ramp-up-on-Hadoop VMs in the market.

We have reached where we are today – 1,000+ commits later – thanks to the talented Cloudera Hue team (special kudos needed to Romain, Enrico, and Abe) and our customers and users in the community. Therefore it is time to celebrate with a classy new logo and community website at!

See Eva’s post for her reflections but I have to say, I do like the new logo:


If Hue has the capability to document the semantics of structures or data, I have overlooked it.

Seems like a golden area for a topic map contribution.

Apache Solr volume 1 -….

Thursday, June 27th, 2013

Apache Solr V[olume] 1 – Introduction, Features, Recency Ranking and Popularity Ranking by Ramzi Alqrainy.

I amended the title to expand v for volume. Just seeing the “v” made me think version. No true in this case.

Nothing new or earthshaking but a nice overview of Solr.

It is a “read along” slide deck so the absence of a presenter won’t impair its usefulness.

Astronomy and Computing

Thursday, June 27th, 2013

Astronomy and Computing

A new journal on astronomy and computing. Potentially a source of important new techniques and algorithms for data processing.

The first volume is online for free but following issues will be behind an Elsevier pay wall.

I will try to keep you advised of interesting new articles.

I first saw this at Bruce Berriman’s Astronomy and Computing: A New Peer Reviewed Astronomy Journal.

elasticsearch 0.90.2

Thursday, June 27th, 2013

0.90.2 released by Clinton Gormley.

From the post:

The Elasticsearch dev team are pleased to announce the release of elasticsearch 0.90.2, which is based on Lucene 4.3.1. You can download it here.

We recommend upgrading to 0.90.2 from 0.90.1, especially if you are using the terms-lookup filter, as this release includes some enhancements and bug fixes to terms lookup.

Besides the other enhancements and bug-fixes, which you can read about on the issues list, there is one new feature that is particularly worth mentioning: improved support for geohashes on geopoints:

A geohash is a string representing an area on earth – the longer the string the more precise the geohash. A geohash just one character long refers to an area with a very rough precision: +/- 2500 km. A geohash of length 8 would be accurate to within 20m, etc. Because a geohash is just a string, we can index it in Elasticsearch and take advantage of the inverted index to make blazingly fast geo-location queries.

Wikipedia on Geohash. Numerous external links, including Enter geo-coordinates or a geohash, displays map with location displayed.

Scaling Through Partitioning and Shard Splitting in Solr 4 (Webinar)

Wednesday, June 26th, 2013

Scaling Through Partitioning and Shard Splitting in Solr 4 by Timothy Potter.

Date: Thursday, July 18, 2013
Time: 10:00am Pacific Time

From the post:

Over the past several months, Solr has reached a critical milestone of being able to elastically scale-out to handle indexes reaching into the hundreds of millions of documents. At Dachis Group, we’ve scaled our largest Solr 4 index to nearly 900M documents and growing. As our index grows, so does our need to manage this growth.

In practice, it’s common for indexes to continue to grow as organizations acquire new data. Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you’ll learn about new features in Solr to help manage large-scale clusters. Specifically, we’ll cover data partitioning and shard splitting.

Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We’ll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.

Attendees will come away from this presentation with a real-world use case that proves Solr 4 is elastically scalable, stable, and is production ready.

Just in time for when you finish your current Solr reading! 😉

Definitely on the calendar!

Developing an Ontology of Legal Research

Wednesday, June 26th, 2013

Developing an Ontology of Legal Research by Amy Taylor.

From the post:

This session will describe my efforts to develop a legal ontology for teaching legal research. There are currently more than twenty legal ontologies worldwide that encompass legal knowledge, legal problem solving, legal drafting and information retrieval, and subjects such as IP, but no ontology of legal research. A legal research ontology could be useful because the transition from print to digital sources has shifted the way research is conducted and taught. Legal print sources have much of the structure of legal knowledge built into them (see the attached slide comparing screen shots from Westlaw and WestlawNext), so teaching students how to research in print also helps them learn the subject they are researching. With the shift to digital sources, this structure is now only implicit, and researchers must rely more upon a solid foundation in the structure of legal knowledge. The session will also describe my choice of OWL as the language that best meets the needs in building this ontology. The session will also explore the possibilities of representing this legal ontology in a more compact visual form to make it easier to incorporate into legal research instruction.

Plus slides and:

Leaving aside Amy’s choice of an ontology, OWL, etc., I would like to focus on her statement:

Legal print sources have much of the structure of legal knowledge built into them (see the attached slide comparing screen shots from Westlaw and WestlawNext), so teaching students how to research in print also helps them learn the subject they are researching. With the shift to digital sources, this structure is now only implicit, and researchers must rely more upon a solid foundation in the structure of legal knowledge.

First, Ann is comparing “Westlaw Classic,” and “WestlawNext,” both digital editions.

Second, the “structure” in question appeared in the “digests” published by West, for example:


And in case head notes as:

head notes

That is the tradition of reporting structure in the digest and only isolated topics in case reports did not start with electronic versions.

That has been the organization of West materials since its beginning in the 19th century.

Third, an “ontology” of the law is quite a different undertaking from the “taxonomy” used by the West system.

The West American Digest System organized law reports to enable researchers to get “close enough” to relevant authorities.

That is the “last semantic mile” was up to the researcher, not the West system.

Even at that degree of coarseness in the West system, it was still an ongoing labor of decades by thousands of editors, and it remains so until today.

The amount of effort expended to obtain a coarse but useful taxonomy of the law should be a fair warning to anyone attempting an “ontology” of the same.

Choosing Crowdsourced Transcription Platforms at SSA 2013

Wednesday, June 26th, 2013

Choosing Crowdsourced Transcription Platforms at SSA 2013

Transcript of Ben Brumfield’s presentation at the Society of Southwestern Archivists. Audio available.

Ben covers the principles of crowd sourced projects, such as:

Now, I’m an open source developer, and in the open source world we tend to differentiate between “free as in beer” or “free as in speech”.


Crowdsourcing projects are really “free as in puppy”. The puppy is free, but you have to take care of it; you have to do a lot of work. Because volunteers that are participating in these things don’t like being ignored. They don’t like having their work lost. They’re doing something that they feel is meaningful and engaging with you, therefore you need to make sure their work is meaningful and engage with them.

For the details on tools, Ben points us to: Collaborative Transcription Tools.

You will need the technology side for a crowd sourced project, topic map related or not.

But don’t neglect the human side of such a project. At least if you want a successful project.

Hunk: Splunk Analytics for Hadoop Beta

Wednesday, June 26th, 2013

Hunk: Splunk Analytics for Hadoop Beta

From the post:

Hunk is a new software product to explore, analyze and visualize data in Hadoop. Building upon Splunk’s years of experience with big data analytics technology deployed at thousands of customers, it drives dramatic improvements in the speed and simplicity of interacting with and analyzing data in Hadoop without programming, costly integrations or forced data migrations.

  • Splunk Virtual Indexing (patent pending) – Explore, analyze and visualize data across multiple Hadoop distributions as if it were stored in a Splunk index
  • Easy to Deploy and Drive Fast Value – Simply point Hunk at your Hadoop cluster and start exploring data immediately
  • Interactive Analysis of Data in Hadoop – Drive deep analysis, pattern detection and find anomalies across terabytes and petabytes of data. Correlate data to spot trends and identify patterns of interest

I think this is the line that will catch most readers:

Hunk is compatible with virtually every leading Hadoop distribution. Simply point it at your Hadoop cluster and start exploring and analyzing your data within minutes.

Professional results may take longer but results within minutes will please most users.

Apache Bigtop: The “Fedora of Hadoop”…

Wednesday, June 26th, 2013

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉


(Participate if you can but at least send a note of appreciation to Cloudera.)

Semantic Queries. Who Knew?

Wednesday, June 26th, 2013

The New Generation of Database Technology Includes Semantics and Search David Gorbet, VP of Engineering for MarkLogic, chatted with Bloor Group Principal Robin Bloor in a recent Briefing Room.

From near the end of the interview:

There’s still a lot of opportunity to light up new scenarios for our customers. That’s why we’re excited about our semantics capabilities in MarkLogic 7. We believe that semantics technology is the next generation of search and discovery, allowing queries based on the concepts you’re looking for and not just the words and phrases. MarkLogic 7 will be the only database to allow semantics queries combined with document search and element/value queries all in one place. Our customers are excited about this.

Need to watch the marketing literature from MarkLogic for riffs and themes to repeat for topic map-based solutions.

Not to mention that topic maps can point into add semantics to existing data stores and their contents.

Re-using current data stores sounds more attractive than ripping out all your data to migrate to another platform.


Hadoop YARN

Wednesday, June 26th, 2013

Hadoop YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

A next-generation framework for Hadoop data processing.

Apache™ Hadoop® YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was borne of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.


As part of Hadoop 2.0, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines. This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management. Many organizations are already building applications on YARN in order to bring them IN to Hadoop.



One of the more accessible explanations of the importance of Hadoop YARN.

Likely not anything new to you but may be helpful when talking to others.

Better Content on Memory Stick?

Wednesday, June 26th, 2013

Sandbox on Memory Stick (pic)

There was talk over at LinkedIn about marketing for topic maps.

Here’s an idea.

No mention of topic maps on the outside but without an install, configuring paths, etc. the user gets a topic map engine plus content.

Topical content for the forum where the sticks are being distributed.

Plug and compare results to your favorite search sewer.

Limited range of data.

But if I am supposed to be searching SEC mandated financial reports and related data, not being able to access Latvian lingerie ads is probably ok. With management at least.*

I first saw this in a tweet by shaunconnolly.

Suggestions for content?

* Just an aside but curated content could provide not only better search results but also eliminate results that may distract staff from the task at hand.

Better than filters, etc. Other content would simply not be an option.

Firefox Delivers 3D Gaming,…

Wednesday, June 26th, 2013

Firefox Delivers 3D Gaming, Video Calls and File Sharing to the Web

From the post:

Rich activities like games and video calls were some of the last remaining challenges to prove that the Web is a capable and powerful platform for complex tasks. We conquered these challenges as part of Mozilla’s mission to advance the Web as the platfo rm for openness, innovation and opportunity for all.

Firefox allows developers to create amazing high-performance Web applications and enables video calls and file-sharing directly in the browser, all without the need for plugins or third-party software. What has been difficult to develop on the Web before is now much easier, faster and more fun.

Mozilla described a supercharged subset of JavaScript (asm.js) that enables developers to create high-intensity applications, like 3D games and photo processing, directly on the Web without having to install additional software or use plugins. Using the Emscripten cross-compiler, which can emit asm.js, it is possible to bring high-performance native applications, like games, to the Web.

The gaming lead almost caused me to skip this item. I haven’t played a video game since Boulder Dash on the Commodore 128. 😉

But support for games also means more mundane applications, like editors, perhaps even collaborative editors, spreadsheets, graphics programs and even topic maps could be supported as well.

Programming model for supercomputers of the future

Wednesday, June 26th, 2013

Programming model for supercomputers of the future

From the post:

The demand for even faster, more effective, and also energy-saving computer clusters is growing in every sector. The new asynchronous programming model GPI might become a key building block towards realizing the next generation of supercomputers.

The demand for even faster, more effective, and also energy-saving computer clusters is growing in every sector. The new asynchronous programming model GPI from Fraunhofer ITWM might become a key building block towards realizing the next generation of supercomputers.

High-performance computing is one of the key technologies for numerous applications that we have come to take for granted – everything from Google searches to weather forecasting and climate simulation to bioinformatics requires an ever increasing amount of computing ressources. Big data analysis additionally is driving the demand for even faster, more effective, and also energy-saving computer clusters. The number of processors per system has now reached the millions and looks set to grow even faster in the future. Yet something has remained largely unchanged over the past 20 years and that is the programming model for these supercomputers. The Message Passing Interface (MPI) ensures that the microprocessors in the distributed systems can communicate. For some time now, however, it has been reaching the limits of its capability.

“I was trying to solve a calculation and simulation problem related to seismic data,” says Dr. Carsten Lojewski from the Fraunhofer Institute for Industrial Mathematics ITWM. “But existing methods weren’t working. The problems were a lack of scalability, the restriction to bulk-synchronous, two-sided communication, and the lack of fault tolerance. So out of my own curiosity I began to develop a new programming model.” This development work ultimately resulted in the Global Address Space Programming Interface – or GPI – which uses the parallel architecture of high-performance computers with maximum efficiency.

GPI is based on a completely new approach: an asynchronous communication model, which is based on remote completion. With this approach, each processor can directly access all data – regardless of which memory it is on and without affecting other parallel processes. Together with Rui Machado, also from Fraunhofer ITWM, and Dr. Christian Simmendinger from T-Systems Solutions for Research, Dr. Carsten Lojewski is receiving a Joseph von Fraunhofer prize this year.

The post concludes with the observation that “…GPI is a tool for specialists….”

Rather surprising since it hasn’t been that many years ago that Hadoop was a tool for specialists. Or that “data mining” was a tool for specialists.

In the last year both Hadoop and “data mining” have come within reach of nearly average users.

GPI if successful for a broad range of problems, a few years will find it under the hood of any nearby cluster.

Perhaps sooner if you take an interest in it.

Where in the World is Edward Snowden?
(Ask PRISM? No. Ask Putin? Yes.)

Tuesday, June 25th, 2013

Are you surprised?

The NSA has Snowden’s name, photograph, cellphone number, names and emails of some of his supporters.

Not to mention the NSA is:

  1. Tracking all air travel and reservations.
  2. Monitoring all cell telephone traffic.
  3. Monitoring all email and Internet traffic to an from known supporters (like Julian Assange and company).
  4. Monitoring all other electronic traffic.

If PRISM can’t help confirm the location of one known individual, how is it going to locate people that are unknown?

Short answer: It’s not. No matter how much dirty data they collect. In fact, the more data they collect, the harder the task becomes.

Targeted data collection, the traditional electronic intercepts used by law enforcement, have been successful for decades on end.

Traditional law enforcement has enough sense to not try to boil the ocean when you want a cup of tea.

Tracking Snowden is one more demonstration that widespread data collection benefits only contractors, agencies and lobbyists.

As presently collected and processed, the NSA data haystack has no value for national security.

Apache Solr Reference Guide (Solr v4.3)

Tuesday, June 25th, 2013

Apache Solr Reference Guide (Solr v4.3) by Cassandra Targett.

From the TOC page:

Getting Started: This section guides you through the installation and setup of Solr.

Using the Solr Administration User Interface: This section introduces the Solr Web-based user interface. From your browser you can view configuration files, submit queries, view logfile settings and Java environment settings, and monitor and control distributed configurations.

Documents, Fields, and Schema Design: This section describes how Solr organizes its data for indexing. It explains how a Solr schema defines the fields and field types which Solr uses to organize data within the document files it indexes.

Understanding Analyzers, Tokenizers, and Filters: This section explains how Solr prepares text for indexing and searching. Analyzers parse text and produce a stream of tokens, lexical units used for indexing and searching. Tokenizers break field data down into tokens. Filters perform other transformational or selective work on token streams.

Indexing and Basic Data Operations: This section describes the indexing process and basic index operations, such as commit, optimize, and rollback.

Searching: This section presents an overview of the search process in Solr. It describes the main components used in searches, including request handlers, query parsers, and response writers. It lists the query parameters that can be passed to Solr, and it describes features such as boosting and faceting, which can be used to fine-tune search results.

The Well-Configured Solr Instance: This section discusses performance tuning for Solr. It begins with an overview of the solrconfig.xml file, then tells you how to configure cores with solr.xml, how to configure the Lucene index writer, and more.

Managing Solr: This section discusses important topics for running and monitoring Solr. It describes running Solr in the Apache Tomcat servlet runner and Web server. Other topics include how to back up a Solr instance, and how to run Solr with Java Management Extensions (JMX).

SolrCloud: This section describes the newest and most exciting of Solr’s new features, SolrCloud, which provides comprehensive distributed capabilities.

Legacy Scaling and Distribution: This section tells you how to grow a Solr distribution by dividing a large index into sections called shards, which are then distributed across multiple servers, or by replicating a single index across multiple services.

Client APIs: This section tells you how to access Solr through various client APIs, including JavaScript, JSON, and Ruby.

Well, I know what I am going to be reading in the immediate future. 😉

ScalaDays 2013 Presentations

Tuesday, June 25th, 2013

ScalaDays 2013 Presentations

Great collection of videos from ScalaDays 2013.

I haven’t had the time to create a better listing but wanted to pass the videos along for your enjoyment.

Introducing Hoya – HBase on YARN

Tuesday, June 25th, 2013

Introducing Hoya – HBase on YARN by Steve Loughran, Devaraj Das & Eric Baldeschwieler.

From the post:

In the last few weeks, we have been getting together a prototype, Hoya, running HBase On YARN. This is driven by a few top level use cases that we have been trying to address. Some of them are:

  • Be able to create on-demand HBase clusters easily -by and or in apps
    • With different versions of HBase potentially (for testing etc.)
  • Be able to configure different Hbase instances differently
    • For example, different configs for read/write workload instances
  • Better isolation
    • Run arbitrary co-processors in user’s private cluster
    • User will own the data that the hbase daemons create
  • MR jobs should find it simple to create (transient) HBase clusters
    • For Map-side joins where table data is all in HBase, for example
  • Elasticity of clusters for analytic / batch workload processing
    • Stop / Suspend / Resume clusters as needed
    • Expand / shrink clusters as needed
  • Be able to utilize cluster resources better
    • Run MR jobs while maintaining HBase’s low latency SLAs

If you are interested in getting in on the ground floor on a promising project, here’s your chance!

True, it is a HBase cluster management project but cluster management abounds in as many subjects as any other IT management area.

Not to mention that few of us ever do just “one job,” at most places. Having multiple skills makes you more marketable.

Getting Started with Cassandra: Overview

Tuesday, June 25th, 2013

Getting Started with Cassandra: Overview by Patricia Gorla.

The start of a four-part introduction to Cassandra.

From the post:

Instead, Cassandra column families (tables) are modeled around the queries you intend to ask.

Not for every use case but no technology meets every possible use case.

A start to a promising series.

Introducing Datameer 3.0 [Pushbutton Analytics]

Tuesday, June 25th, 2013

Introducing Datameer 3.0 by Stefan Groschupf

From the post:

Today, we are doubling down on our promise of making big data analytics on Hadoop self-service and a business user function with the introduction of Smart Analytics in Datameer 3.0. You can get the full details in our press release, or on our website, but in a single sentence, we’re giving subject matter experts like doctors, marketeers, or financial analysts a way to do actual data science with simple point and clicks. What once were complex algorithms are now buttons you can click that will “automagically” identify groups, relationships, patterns, and even build recommendations based on your data. A data scientist would call what we’re empowering business users to do ‘data mining’ or ‘machine learning,’ but we aren’t building a tool for data scientists. This is Smart Analytics.

A very good example that “data mining” and “machine learning” are useful, but not on the radar of the average user.

Users have some task they want to accomplish, whether that takes “data mining” or “machine learning” or enslaved fairies, they could care less.

The same can be said for promoting topic maps.

Subject identity, associations, etc., are interesting to a very narrow slice of the world’s population.

What is of interest to a very large slice of the world’s population is gaining some advantage over competitors or a benefit others don’t enjoy.

To the extent that subject identity and robust merging can help in those tasks, they are interested. But otherwise, not.

The Problem with RDF and Nuclear Power

Tuesday, June 25th, 2013

The Problem with RDF and Nuclear Power by Manu Sporny.

Manu starts his post:

Full disclosure: I am the chair of the RDFa Working Group, the JSON-LD Community Group, a member of the RDF Working Group, as well as other Semantic Web initiatives. I believe in this stuff, but am critical about the path we’ve been taking for a while now.


RDF shares a number of these similarities with nuclear power. RDF is one of the best data modeling mechanisms that humanity has created. Looking into the future, there is no equally-powerful, viable alternative. So, why has progress been slow on this very exciting technology? There was no public mis-information campaign, so where did this negative view of RDF come from?

In short, RDF/XML was the Semantic Web’s 3 Mile Island incident. When it was released, developers confused RDF/XML (bad) with the RDF data model (good). There weren’t enough people and time to counter-act the negative press that RDF was receiving as a result of RDF/XML and thus, we are where we are today because of this negative perception of RDF. Even Wikipedia’s page on the matter seems to imply that RDF/XML is RDF. Some purveyors of RDF think that the public perception problem isn’t that bad. I think that when developers hear RDF, they think: “Not in my back yard”.

The solution to this predicament: Stop mentioning RDF and the Semantic Web. Focus on tools for developers. Do more dogfooding.

Over the years I have become more and more agnostic towards data models.

The real question for any data model is whether it fits your requirements. What other test would you have?

For merging data held in different data models or data models that don’t recognize the same subject identified differently, then subject identity and its management comes into play.

Subject identity and its management not being an area that has only one answer for any particular problem.

Manu does have concrete suggestions for how to advance topic maps, either as a practice of subject identity or a particular data model:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of RDF, theoretical value, or design. Deliver production-ready, open-source software tools.
  3. Build a network of believers by spending more of your time working with Web developers and open-source projects to convince them to publish Linked Data. Dogfood our work.

A topic map version of those suggestions:

  1. The message shouldn’t be about the technology. It should be about the problems we have today and a concrete solution on how to address those problems.
  2. Demonstrate real value. Stop talking about the beauty of topic maps, theoretical value, or design. Deliver high quality content from merging diverse data sources. (Tools will take care of themselves if the content is valuable enough.)
  3. Build a network of customers by spending more of your time using topic maps to distinguish your content from content from the average web sewer.

As an information theorist I should be preaching to myself. Yes?


As the semantic impedance of the “Semantic Web,” “big data,” “NSA Data Cloud,” increases, the opportunities for competitive, military, industrial advantage from reliable semantic integration will increase.

Looking for showcase opportunities.


The Homotopy Type Theory Book is out!

Tuesday, June 25th, 2013

The Homotopy Type Theory Book is out! by Robert Harper.

From the post:

By now many of you have heard of the development of Homotopy Type Theory (HoTT), an extension of intuitionistic type theory that provides a natural foundation for doing synthetic homotopy theory. Last year the Institute for Advanced Study at Princeton sponsored a program on the Univalent Foundations of Mathematics, which was concerned with developing these ideas. One important outcome of the year-long program is a full-scale book presenting the main ideas of Homotopy Type Theory itself and showing how to apply them to various branches of mathematics, including homotopy theory, category theory, set theory, and constructive analysis. The book is the product of a joint effort by dozens of participants in the program, and is intended to document the state of the art as it is known today, and to encourage its further development by the participation of others interested in the topic (i.e., you!). Among the many directions in which one may take these ideas, the most important (to me) is to develop a constructive (computational) interpretation of HoTT. Some partial results in this direction have already been obtained, including fascinating work by Thierry Coquand on developing a constructive version of Kan complexes in ITT, by Mike Shulman on proving homotopy canonicity for the natural numbers in a two-dimensional version of HoTT, and by Dan Licata and me on a weak definitional canonicity theorem for a similar two-dimensional theory. Much work remains to be done to arrive at a fully satisfactory constructive interpretation, which is essential for application of these ideas to computer science. Meanwhile, though, great progress has been made on using HoTT to formulate and formalize significant pieces of mathematics in a new, and strikingly beautiful, style, that are well-documented in the book.

The book is freely available on the web in various formats, including a PDF version with active references, an ebook version suitable for your reading device, and may be purchased in hard- or soft-cover from Lulu. The book itself is open source, and is available at the Hott Book Git Hub. The book is under the Creative Commons CC BY-SA license, and will be freely available in perpetuity.

Readers may also be interested in the posts on Homotopy Type Theory, the n-Category Cafe, and Mathematics and Computation which describe more about the book and the process of its creation.

I can’t promise you that Homotopy Type Theory is going to be immediately useful in your topic map practice.

However, any theory that aims at replacing set theory (and it definitions of equality) is potentially useful for topic maps.

There are doctrines of subject equivalence far beyond simple string matches and no doubt clients who are willing to pay for them.

W3C Open Annotation: Status and Use Cases

Tuesday, June 25th, 2013

W3C Open Annotation: Status and Use Cases by Robert Sanderson and Paolo Ciccarese.

Presentation slides from OAI8: Innovations in Scholarly Communication, June 19-21 2013, Geneva, Switzerland.

For more details about the OpenAnnotation group:

Annotation, particularly if data storage becomes immutable, will become increasingly important.

Perhaps a revival of HyTime-based addressing or a robust version of XLink is in the offing.

As we have recently learned from the NSA, “web scale” data isn’t very much data at all.

Our addressing protocols should not be limited to any particular data subset.

Hadoop for Everyone: Inside Cloudera Search

Tuesday, June 25th, 2013

Hadoop for Everyone: Inside Cloudera Search by Eva Andreasson.

From the post:

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

Helping these users find the data they need without the need for Java, SQL, or scripting languages inspired integrating full-text search functionality, via Cloudera Search (currently in beta), with the powerful processing platform of CDH. The idea of using search on the same platform as other workloads is the key — you no longer have to move data around to satisfy your business needs, as data and indices are stored in the same scalable and cost-efficient platform. You can also not only find what you are looking for, but within the same infrastructure actually “do” things with your data. Cloudera Search brings simplicity and efficiency for large and growing data sets that need to enable mission-critical staff, as well as the average user, to find a needle in an unstructured haystack!

As a workload natively integrated with CDH, Cloudera Search benefits from the same security model, access to the same data pool, and cost-efficient storage. In addition, it is added to the services monitored and managed by Cloudera Manager on the cluster, providing a unified production visibility and rich cluster management – a priceless tool for any cluster admin.

In the rest of this post, I’ll describe some of Cloudera Search’s most important features.

You have heard the buzz about Cloudera Search, now get a quick list of facts and pointers to more resources!

The most significant fact?

Cloudera Search uses Apache Solr.

If you are looking for search capabilities, what more need I say?

Balisage 2013 Program Finalized

Monday, June 24th, 2013

It seems to happen every year when Balisage finalizes its program.

There is a burst of not very interesting or important stories that drive the Balisage final program off the home page at

Instead, you can read about an old lecher, the nine ride again, and who wants to go to Ecuador?

Why any of that would kick the final Balisage program off CNN’s homepage, I can’t say.

What I can say is how excellent the late additions to the program appear:

Topics added include:
  • The new W3C publishing activity
  • Marking up Changes in XML Documents
  • Comparing Document Grammars using XQuery
  • User interface styles for a web interface design framework
  • Rights metadata standards
  • A general purpose architecture for making slides from XML documents
  • Architectural forms for the 21st century

I particularly want to hear about “Architectural forms for the 21st century!”


Schedule At A Glance:
Detailed program:

Annual update released for TeX Live (2013)

Monday, June 24th, 2013

Annual update released for TeX Live

From the post:

The developers of the TeX Live distribution of LaTeX have released their annual update. However, after 17 years of development, the changes in TeX Live 2013 mostly amount to technical details.

The texmf/ directory, for example, has been merged into texmf-dist/, while the TEXMFMAIN and TEXMFDIST Kpathsea variables now point to texmf-dist. The developers have also merged several language collections for easier installation. Users will find native support for PNG output and floating-point numbers in MetaPost. LuaTeX now uses version 5.2 of Lua and includes a new library (pdfscanner) for processing external PDF data, and xdvi now uses freetype instead of t1lib for rendering.

Several updates have been made to XeTeX: HarfBuzz is now used instead of ICU for font layout and has been combined with Graphite2 to replace SilGraphite for Graphite layout; support has also been improved for OpenType.

TeX Live 2013 is open source software, licensed under a combination of the LaTeX Project Public License (LPPL) and a number of other licences. The software works on all of the major operating systems, although the program no longer runs on AIX systems using PowerPCs. Mac OS X users may want to take a look at MacTeX, which is based on – and has been updated in line with – TeX Live.

No major changes but we should be grateful for the effort that resulted in this release.

Building Distributed Systems With The Right Tools:…

Monday, June 24th, 2013

Building Distributed Systems With The Right Tools: Akka Looks Promising

From the post:

Modern day developers are building complex applications that span multiple machines. As a result, availability, scalability, and fault tolerance are important considerations that must be addressed if we are to successfully meet the needs of the business.

As developers building distributed systems, then, being aware of concepts and tools that help in dealing with these considerations is not just important – but allows us to make a significant difference to the success of the projects we work on.

One emerging tool is Akka and it’s clustering facilities. Shortly I’ll show a few concepts to get your mind thinking about where you could apply tools like Akka, but I’ll also show a few code samples to emphasise that these benefits are very accessible.

Code sample for this post is on github.

Why Should I Care About Akka?

Let’s start with a problem… We’re building a holidays aggregration and disitribution platform. What our system does is fetch data from 200 different package providers, and distribute it to over 50 clients via ftp. This is a continuous process.

Competition in this market is fierce and clients want holidays and upto date availability in their systems as fast as possible – there’s a lot of money to be made on last-minute deals, and a lot of money to be lost in selling holiday’s that have already been sold elsewhere.

One key feature then is that the system needs to always be running – it needs high availability. Another important feature is performance – if this is to be maintained as the system grows with new providers and clients it needs to be scalable.

Just think to yourself now, how would you achieve this with the technologies you currently work with? I can’t think of too many things in the .NET world that would guide me towards highly-available, scalable applications, out of the box. There would be a lot of home-rolled infrastructure, and careful designing for scalability I suspect.

Akka Wants to Help You Solve These Problems ‘Easily’

Using Akka you don’t call methods – you send messages. This is because the programming model makes the assumption that you are building distributed, asynchronous applications. It’s just a bonus if a message gets sent and handled on the same machine.

This arises from the fact that the framework is engineered, fundamentally, to guide you into creating highly-available, scalable, fault-tolerant distributed applications…. There is no home-rolled infrastructure (you can add small bits and pieces if you need to).

Instead, with Akka you mostly focus on business logic as message flows. Check out the docs or pick up a book if you want to learn about the enabling concepts like supervision.

If you are contemplating a distributed topic map application, Akka should be of interest.

Work flow could result in different locations reflecting different topic map content.

Hybrid Indexes for Repetitive Datasets

Monday, June 24th, 2013

Hybrid Indexes for Repetitive Datasets by H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi.


Advances in DNA sequencing mean databases of thousands of human genomes will soon be commonplace. In this paper we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we preprocess the text with LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show this also significantly reduces query times.

Need another repetitive data set?

Have you thought about topic maps?

If there is to be any merging in a topic map there are multiple topics that represent the same subjects.

This technique may be overkill for some hardly merging topic maps but if you had the endless repetition that you find in linked data versions of Wikipedia data, there it would be quite useful.

That might knock down the “Some-Smallish-Number” of triples count and so would be disfavored.

On the other hand, there are other data sets with massive replication (think phone records) where fast querying could be an advantage.

Succinct data structures for representing equivalence classes

Monday, June 24th, 2013

Succinct data structures for representing equivalence classes by Moshe Lewenstein, J. Ian Munro, and Venkatesh Raman.


Given a partition of an n element set into equivalence classes, we consider time-space tradeoffs for representing it to support the query that asks whether two given elements are in the same equivalence class. This has various applications including for testing whether two vertices are in the same component in an undirected graph or in the same strongly connected component in a directed graph.

We consider the problem in several models.

— Concerning labeling schemes where we assign labels to elements and the query is to be answered just by examining the labels of the queried elements (without any extra space): if each vertex is required to have a unique label, then we show that a label space of (\sum_{i=1}^n \lfloor {n \over i} \rfloor) is necessary and sufficient. In other words, \lg n + \lg \lg n + O(1) bits of space are necessary and sufficient for representing each of the labels. This slightly strengthens the known lower bound and is in contrast to the known necessary and sufficient bound of \lceil \lg n \rceil for the label length, if each vertex need not get a unique label.

–Concerning succinct data structures for the problem when the n elements are to be uniquely assigned labels from label set {1, 2, …n}, we first show that \Theta(\sqrt n) bits are necessary and sufficient to represent the equivalence class information. This space includes the space for implicitly encoding the vertex labels. We can support the query in such a structure in O(\lg n) time in the standard word RAM model. We then develop structures resulting in one where the queries can be supported in constant time using O({\sqrt n} \lg n) bits of space. We also develop space efficient structures where union operation along with the equivalence query can be answered fast.

On the down side, this technique would not support merging based on arbitrary choice of properties.

On the up side, this technique does support merging based on pre-determined properties for merging.

The latter being the more common case, I commend this article to you for a close read.