Archive for the ‘Aggregation’ Category

Exploring and Visualizing Pre-Topic Map Data

Monday, November 2nd, 2015

AggreSet: Rich and Scalable Set Exploration using Visualizations of Element Aggregations by M. Adil Yalçın, Niklas Elmqvist, and Benjamin B. Bederson.


Datasets commonly include multi-value (set-typed) attributes that describe set memberships over elements, such as genres per movie or courses taken per student. Set-typed attributes describe rich relations across elements, sets, and the set intersections. Increasing the number of sets results in a combinatorial growth of relations and creates scalability challenges. Exploratory tasks (e.g. selection, comparison) have commonly been designed in separation for set-typed attributes, which reduces interface consistency. To improve on scalability and to support rich, contextual exploration of set-typed data, we present AggreSet. AggreSet creates aggregations for each data dimension: sets, set-degrees, set-pair intersections, and other attributes. It visualizes the element count per aggregate using a matrix plot for set-pair intersections, and histograms for set lists, set-degrees and other attributes. Its non-overlapping visual design is scalable to numerous and large sets. AggreSet supports selection, filtering, and comparison as core exploratory tasks. It allows analysis of set relations inluding subsets, disjoint sets and set intersection strength, and also features perceptual set ordering for detecting patterns in set matrices. Its interaction is designed for rich and rapid data exploration. We demonstrate results on a wide range of datasets from different domains with varying characteristics, and report on expert reviews and a case study using student enrollment and degree data with assistant deans at a major public university.

These two videos will give you a better overview of AggreSet than I can. The first one is about 30 seconds and the second one about 5 minutes.

The visualization of characters from Les Misérables (the second video) is a dynamite demonstration of how you could explore pre-topic map data with an eye towards creating roles and associations between characters as well as with the text.

First use case that pops to mind would be harvesting the fan posts on Harry Potter and crossing them with a similar listing of characters from the Harry Potter book series. With author, date, book, character, etc., relationships.

While you are at the GitHub site:, be sure to bounce up a level to Keshif:

Keshif is a web-based tool that lets you browse and understand datasets easily.

To start using Keshif:

  • Get the source code from github,
  • Explore the existing datasets and their source codes, and
  • Check out the wiki.

Or just go directly to the Keshif site, with 110 datasets (as of today)>

For the impatient, see Loading Data.

For the even more impatient:

You can load data to Keshif from :

  • Google Sheets
  • Text File
    • On Google Drive
    • On Dropbox
    • File on your webserver

Text File Types

Keshif can be used with the following data file types:

  • CSV / TSV
  • JSON
  • XML
  • Any other file type that you can load and parse in JavaScript. See Custom Data Loading

Hint: The dataset explorer at the frontpage indexes demos by file type and resource. Filter by data source to find example source code on how to apply a specific file loading approach.

The critical factor, in addition to its obvious usefulness, is that it works in a web browser. You don’t have to install software, set Java paths, download additional libraries, etc.

Are you using the modern web browser as your target for user facing topic map applications?

I first saw this in a tweet by Christophe Lalanne.

New Spatial Aggregation Tutorial for GIS Tools for Hadoop

Sunday, March 29th, 2015

New Spatial Aggregation Tutorial for GIS Tools for Hadoop by Sarah Ambrose.

Sarah motivates you to learn about spatial aggregation, aka spatial binning, with two visualizations of New York taxi data:

No aggregation:




Now that I have your attention, ;-), from the post:

This spatial analysis ability is available using the GIS Tools for Hadoop. There is a new tutorial posted that takes you through the steps of aggregating taxi data. You can find the tutorial here.


NewsStand: A New View on News (+ Underwear Down Under)

Saturday, March 28th, 2015

NewsStand: A New View on News by Benjamin E. Teitler, et al.


News articles contain a wealth of implicit geographic content that if exposed to readers improves understanding of today’s news. However, most articles are not explicitly geotagged with their geographic content, and few news aggregation systems expose this content to users. A new system named NewsStand is presented that collects, analyzes, and displays news stories in a map interface, thus leveraging on their implicit geographic content. NewsStand monitors RSS feeds from thousands of online news sources and retrieves articles within minutes of publication. It then extracts geographic content from articles using a custom-built geotagger, and groups articles into story clusters using a fast online clustering algorithm. By panning and zooming in NewsStand’s map interface, users can retrieve stories based on both topical signifi cance and geographic region, and see substantially diff erent stories depending on position and zoom level.

Of particular interest to topic map fans:

NewsStand’s geotagger must deal with three problematic cases in disambiguating terms that could be interpreted as locations: geo/non-geo ambiguity, where a given phrase might refer to a geographic location, or some other kind of entity; aliasing, where multiple names refer to the same geographic location, such as “Los Angeles” and “LA”; and geographic name ambiguity or polysemy , where a given name might refer to any of several geographic locations. For example, “Springfield” is the name of many cities in the USA, and thus it is a challenge for disambiguation algorithms to associate with the correct location.

Unless you want to hand disambiguate all geographic references in your sources, this paper merits a close read!

BTW, the paper dates from 2008 and I saw it in a tweet by Kirk Borne, where Kirk pointed to a recent version of NewsStand. Well, sort of “recent.” The latest story I could find was 490 days ago, a tweet from CBS News about the 50th anniversary of the Kennedy assassination in Dallas.

Undaunted I checked out TwitterStand but it seems to suffer from the same staleness of content, albeit it is difficult to tell because links don’t lead to the tweets.

Finally I did try PhotoStand, which judging from the pop-up information on the images, is quite current.

I noticed for Perth, Australia, “A special section of the exhibition has been dedicated to famous dominatrix Madame Lash.”

Sadly this appears to be one the algorithm got incorrect, so members of Congress should not select purchase on their travel arrangements just yet.

Sarah Carty for Daily Mail Australia reports in From modest bloomers to racy corsets: New exhibition uncovers the secret history of women’s underwear… including a unique collection from dominatrix Madam Lash:

From the modesty of bloomers to the seductiveness of lacy corsets, a new exhibition gives us a rare glimpse into the most intimate and private parts of history.

The Powerhouse Museum in Sydney have unveiled their ‘Undressed: 350 Years of Underwear in Fashion’ collection, which features undergarments from the 17th-century to more modern garments worn by celebrities such as Emma Watson, Cindy Crawford and even Queen Victoria.

Apart from a brief stint in Bendigo and Perth, the collection has never been seen by any members of the public before and lead curator Edwina Ehrman believes people will be both shocked and intrigued by what’s on display.

So the collection was once shown in Perth, but for airline reservations you had best book for Sydney.

And no, I won’t leave you without the necessary details:

Undressed: 350 Years of Underwear in Fashion opens at the Powerhouse Museum on March 28 and runs until 12 July 2015. Tickets can be bought here.

Ticket prices do not include transportation expenses to Sydney.

Spoiler alert: The exhibition page says:

Please note that photography is not permitted in this exhibition.



Tuesday, February 11th, 2014

CUBE and ROLLUP: Two Pig Functions That Every Data Scientist Should Know by Joshua Lande.

From the post:

I recently found two incredible functions in Apache Pig called CUBE and ROLLUP that every data scientist should know. These functions can be used to compute multi-level aggregations of a data set. I found the documentation for these functions to be confusing, so I will work through a simple example to explain how they work.

Joshua starts his post with a demonstration of using GROUP BY in Pig for simple aggregations. That sets the stage for demonstrating how important CUBE and ROLLUP can be for data aggregations in PIG.

Interesting possibilities suggest themselves by the time you finish Joshua’s posting.

I first saw this in a tweet by Dmitriy Ryaboy.

Open Source Release: java-hll

Friday, January 3rd, 2014

Open Source Release: java-hll

From the post:

We’re happy to announce our newest open-source project, java-hll, a HyperLogLog implementation in Java that is storage-compatible with the previously released postgresql-hll and js-hll implementations. And as the rule of three dictates, we’ve also extracted the storage specification that makes them interoperable into it’s own repository. Currently, all three implementations support reading storage specification v1.0.0, while only the PostgreSQL and Java implementations fully support writing v1.0.0. We hope to bring the JS implementation up to speed, with respect to serialization, shortly.

For reasons to be excited, see my HyperLogLog post archive.

ElasticSearch 1.0.0.Beta2 released

Monday, December 2nd, 2013

ElasticSearch 1.0.0.Beta2 released by Clinton Gromley.

From the post:

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta2, the second beta release on the road to 1.0.0 GA. The new features we have planned for 1.0.0 have come together more quickly than we expected, and this beta release is chock full of shiny new toys. Christmas has come early!

We have added:

Please download elasticsearch 1.0.0.Beta2, try it out, break it, figure out what is missing and tell us about it. Our next release will focus on cleaning up inconsistent APIs and usability, plus fixing any bugs that are reported in the new functionality, so your early bug reports are an important part of ensuring that 1.0.0 GA is solid.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

Suggestion: Pay close attention to the documentation on the new aggregation capabilities.

For example:

There are many different types of aggregations, each with its own purpose and output. To better understand these types, it is often easier to break them into two main families:

Bucketing: A family of aggregations that build buckets, where each bucket is associated with a key and a document criteria. When the aggregations is executed, the buckets criterias are evaluated on every document in the context and when matches, the document is considered to “fall in” the relevant bucket. By the end of the aggreagation process, we’ll end up with a list of buckets – each one with a set of documents that “belong” to it.

Metric: Aggregations that keep track and compute metrics over a set of documents

The interesting part comes next, since each bucket effectively defines a document set (all documents belonging to the bucket), one can potentially associated aggregations on the bucket level, and those will execute within the context of that bucket. This is where the real power of aggregations kicks in: aggregations can be nested!

Interesting, yes?

[A]ggregation is really hard

Tuesday, November 5th, 2013

The quote in full reads:

The reason that big data hasn’t taken economics by storm is that aggregation is really hard.

As near as I can tell, the quote originates from Klint Finley.

Let’s assume aggregation is “really hard,” which of necessity means it is also expensive.

Might be nice, useful, interesting, etc., to have “all” the available data for a subject organized together, what is the financial (or other) incentive to make it so?

And that incentive has to be out weigh the expense of reaching that goal. By some stated margin.

Suggestions on how to develop such a metric?

Aggregation Options on Big Data Sets Part 1… [MongoDB]

Friday, August 23rd, 2013

Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set by Daniel Alabi and Sweet Song, MongoDB Summer Interns.

From the post:

Flights Dataset Overview

This is the first of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

The first dataset explored was a domestic flights dataset. The Bureau of Transportation Statistics provides information for every commercial flight from 1987, but we narrowed down our project to focus on the most recent available data for the past year (April 2012-March 2013).

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

To get started, we wanted to answer a few basic questions concerning the dataset:

  1. When is the best time of day/day of week/time of year to fly to minimize delays?
  2. What types of planes suffer the most delays? How old are these planes?
  3. How often does a delay cascade into other flight delays?
  4. What was the effect of Hurricane Sandy on air transportation in New York? How quickly did the state return to normal?

A series of blog posts to watch!

I thought the comment:

We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.

was remarkably honest.

The Department of Transportation Table/Field guide reveals that the fields are mostly populated by codes, IDs and date/time values.

Values that lend themselves to easy aggregation.

Looking forward to harder aggregation examples as this series develops.

Aggregation Module – Phase 1 – Functional Design (ElasticSearch #3300)

Friday, July 12th, 2013

Aggregation Module – Phase 1 – Functional Design (ElasticSearch Issue #3300)

From the post:

The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as “faceting”. Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.

The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name (“Aggregations”) also indicate the intention here – a generic yet extremely powerful framework for defining aggregations – any type of aggregation. The idea here is to have each aggregation defined as a “standalone” aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we’ve gained over the years working with facets and apply it when building the new framework.


If you have been following the discussion about “what would we do differently with topic maps” in the XTM group at LinkedIn, this will be of interest.

What is an aggregation if it is not a selection of items matching some criteria, which you can then “merge” together for presentation to a user?

Or “merge” together for further querying?

That is inconsistent with the imperative programming model of the TMDM, but it has the potential to open up distributed and parallel processing of topic maps.

Same paradigm but with greater capabilities.

Open Source Release: postgresql-hll

Saturday, May 25th, 2013

Open Source Release: postgresql-hll

From the post:

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

  • C-based extension that provides the hll data structure and algorithms
  • Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
  • Full storage specification in STORAGE.markdown
  • Full function reference in REFERENCE.markdown
  • .spec file for rpmbuild
  • Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

Would you say topic maps aggregate data?

I thought so.

How would you adapt HLL to synonymous identifications?

I ask because of this line in the post:

Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

Performing “…interactive set-intersection of operands with millions or billions of entries in milliseconds…” sounds like an attractive feature for topic map software.


Real-Time Data Aggregation [White Paper Registration Warning]

Tuesday, April 30th, 2013

Real-Time Data Aggregation by Caroline Lim.

From the post:

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Infochimps Cloud::Streams

Read Infochimps’ newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations

Tuesday, January 29th, 2013

Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations by Arnab Bhattacharya and B. Palvali Teja.
(Submitted on 28 Jun 2012)


The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as “aggregate skyline join queries”. Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the algorithms with respect to the cardinality and dimensionality of the relations.

The authors illustrate a “skyline” query with a search for a hotel that has a good price and it close to the beach. A “skyline” set of hotels excludes hotels that are not as good on those points as hotels in the set. They then observe:

In real applications, however, there often exists a scenario when a single relation is not sufficient for the application, and the skyline needs to be computed over multiple relations [16]. For example, consider a flight database. A person traveling from city A to city B may use stopovers, but may still be interested in flights that are cheaper, have a less overall journey time, better ratings and more amenities. In this case, a single relation specifying all direct flights from A to B may not suffice or may not even exist. The join of multiple relations consisting of flights starting from A and those ending at B needs to be processed before computing the preferences.

The above problem becomes even more complex if the person is interested in the travel plan that optimizes both on the total cost as well as the total journey time for the two flights (other than the ratings and amenities of each
airline). In essence, the skyline now needs to be computed on attributes that have been aggregated from multiple relations in addition to attributes whose preferences are local within each relation. The common aggregate operations are sum, average, minimum, maximum, etc.

No doubt the travel industry thinks it has conquered semantic diversity in travel arrangements. If they have, it has since I stopped traveling several years ago.

Even simple tasks such as coordination of air and train schedules was unnecessarily difficult.

I suspect that is still the case and so mention “skyline” queries as a topic to be aware of and if necessary, to include in a topic map application that brings sanity to travel arrangements.

True, you can get a travel service that handles all the details, but only for a price and only if you are that trusting.

MongoDB 2.2 Released [Aggregation News – Expiring Data From Merges?]

Thursday, August 30th, 2012

MongoDB 2.2 Released

From the post:

We are pleased to announce the release of MongoDB version 2.2. This release includes over 1,000 new features, bug fixes, and performance enhancements, with a focus on improved flexibility and performance. For additional details on the release:

Of particular interest to topic map fans:

Aggregation Framework

The Aggregation Framework is available in its first production-ready release as of 2.2. The aggregation framework makes it easier to manipulate and process documents inside of MongoDB, without needing to use Map Reduce, or separate application processes for data manipulation.

See the aggregation documentation for more information.

The H Open also mentions TTL (time to live) which can remove documents from collections.

MongoDB documentation: Expire Data from Collections by Setting TTL.

Have you considered “expiring” data from merges?

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)


One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

The Case for Curation: The Relevance of Digest and Citator Results in Westlaw and Lexis

Wednesday, July 25th, 2012

The Case for Curation: The Relevance of Digest and Citator Results in Westlaw and Lexis by Susan Nevelow Mart and Jeffrey Luftig.


Humans and machines are both involved in the creation of legal research resources. For legal information retrieval systems, the human-curated finding aid is being overtaken by the computer algorithm. But human-curated finding aids still exist. One of them is the West Key Number system. The Key Number system’s headnote classification of case law, started back in the nineteenth century, was and is the creation of humans. The retrospective headnote classification of the cases in Lexis’s case databases, started in 1999, was created primarily although not exclusively with computer algorithms. So how do these two very different systems deal with a similar headnote from the same case, when they link the headnote to the digesting and citator functions in their respective databases? This paper continues an investigation into this question, looking at the relevance of results from digest and citator search run on matching headnotes in ninety important federal and state cases, to see how each performs. For digests, where the results are curated – where a human has made a judgment about the meaning of a case and placed it in a classification system – humans still have an advantage. For citators, where algorithm is battling algorithm to find relevant results, it is a matter of the better algorithm winning. But no one algorithm is doing a very good job of finding all the relevant results; the overlap between the two citator systems is not that large. The lesson for researchers: know how your legal research system was created, what involvement, if any, humans had in the curation of the system, and what a researcher can and cannot expect from the system you are using.

A must read for library students and legal researchers.

For legal research, the authors conclude:

The intervention of humans as curators in online environments is being recognized as a way to add value to an algorithm’s results, in legal research tools as well as web-based applications in other areas. Humans still have an edge in predicting which cases are relevant. And the intersection of human curation and algorithmically-generated data sets is already well underway. More curation will improve the quality of results in legal research tools, and most particularly can be used to address the algorithmic deficit that still seems to exist where analogic reasoning is needed. So for legal research, there is a case for curation. [footnotes omitted]

The distinction between curation, human gathering of relevant material and aggregation, machine gathering of potentially relevant material looks quite useful.

Curation anyone?

I first saw this at Legal Informatics.

Implementing Aggregation Functions in MongoDB

Tuesday, June 26th, 2012

Implementing Aggregation Functions in MongoDB by Arun Viswanathan and Shruthi Kumar.

From the post:

With the amount of data that organizations generate exploding from gigabytes to terabytes to petabytes, traditional databases are unable to scale up to manage such big data sets. Using these solutions, the cost of storing and processing data will significantly increase as the data grows. This is resulting in organizations looking for other economical solutions such as NoSQL databases that provide the required data storage and processing capabilities, scalability and cost effectiveness. NoSQL databases do not use SQL as the query language. There are different types of these databases such as document stores, key-value stores, graph database, object database, etc.

Typical use cases for NoSQL database includes archiving old logs, event logging, ecommerce application log, gaming data, social data, etc. due to its fast read-write capability. The stored data would then require to be processed to gain useful insights on customers and their usage of the applications.

The NoSQL database we use in this article is MongoDB which is an open source document oriented NoSQL database system written in C++. It provides a high performance document oriented storage as well as support for writing MapReduce programs to process data stored in MongoDB documents. It is easily scalable and supports auto partitioning. Map Reduce can be used for aggregation of data through batch processing. MongoDB stores data in BSON (Binary JSON) format, supports a dynamic schema and allows for dynamic queries. The Mongo Query Language is expressed as JSON and is different from the SQL queries used in an RDBMS. MongoDB provides an Aggregation Framework that includes utility functions such as count, distinct and group. However more advanced aggregation functions such as sum, average, max, min, variance and standard deviation need to be implemented using MapReduce.

This article describes the method of implementing common aggregation functions like sum, average, max, min, variance and standard deviation on a MongoDB document using its MapReduce functionality. Typical applications of aggregations include business reporting of sales data such as calculation of total sales by grouping data across geographical locations, financial reporting, etc.

Not terribly advanced but enough to get you started with creating aggregation functions.

Includes “testing” of the aggregation functions that are written in the article.

If Python is more your cup of tea, see: Aggregation in MongoDB (part1) and Aggregation in MongoDB (part 2).

OData Extensions for Data Aggregation

Friday, June 15th, 2012

OData Extensions for Data Aggregation by Chris Webb.

Chris writes:

I was just reading the following blog post on the OASIS OData Technical Committee Call for Participation:

…when I saw this:

In addition to the core OData version 3.0 protocol found here, the Technical Committee will be defining some key extensions in the first version of the OASIS Standard:

OData Extensions for Data Aggregation – Business Intelligence provides the ability to get the right set of aggregated results from large data warehouses. OData Extensions for Analytics enable OData to support Business Intelligence by allowing services to model data analytic “cubes” (dimensions, hierarchies, measures) and consumers to query aggregated data

Follow the link in the quoted text – it’s very interesting reading! Here’s just one juicy quote:

You have to go to Chris’ post to see the “juicy quote.” 😉

With more data becoming available, at higher speeds, data aggregation is going to be the norm.

Some people will do it well. Some people will do it not so well.

Which one will describe you?

Participation in the OData TC at OASIS may help shape that answer:

First meeting details:

The first meeting of the Technical Committee will be a face-to-face meeting to be held in Redmond, Washington on July 26-27, 2012 from 9 AM PT to 5 PM PT. This meeting will be sponsored by Microsoft. Dial-in conference calling bridge numbers will be available for those unable to attend in person.

At least the meeting is on a Thursday/Friday slot! Any comments on the weather to expect in late July?

Real-time Analytics with HBase [Aggregation is a form of merging.]

Monday, June 11th, 2012

Real-time Analytics with HBase

From the post:

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month. In this presentation Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT. If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

The slides come in a long and short version. Both are very good but I suggest the long version.

I particularly liked the “Background: pre-aggregation” slide (8 in the short version, 9 in the long version).

Aggregation as a form of merging.

What information is lost as part of aggregation? (That assumes we know the aggregation process. Without that, can’t say what is lost.)

What information (read subjects/relationships) do we want to preserve through an aggregation process?

What properties should those subjects/relationships have?

(Those are topic map design/modeling questions.)

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2)

Monday, June 4th, 2012

Using MongoDB’s New Aggregation Framework in Python (MongoDB Aggregation Part 2) by Rick Copeland.

From the post:

Continuing on in my series on MongoDB and Python, this article will explore the new aggregation framework introduced in MongoDB 2.1. If you’re just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you’re all caught up, let’s jump right in….

Why a new framework?

If you’ve been following along with this article series, you’ve been introduced to MongoDB’s mapreduce command, which up until MongoDB 2.1 has been the go-to aggregation tool for MongoDB. (There’s also the group() command, but it’s really no more than a less-capable and un-shardable version of mapreduce(), so we’ll ignore it here.) So if you already have mapreduce() in your toolbox, why would you ever want something else?

Mapreduce is hard; let’s go shopping

The first motivation behind the new framework is that, while mapreduce() is a flexible and powerful abstraction for aggregation, it’s really overkill in many situations, as it requires you to re-frame your problem into a form that’s amenable to calculation using mapreduce(). For instance, when I want to calculate the mean value of a property in a series of documents, trying to break that down into appropriate map, reduce, and finalize steps imposes some extra cognitive overhead that we’d like to avoid. So the new aggregation framework is (IMO) simpler.

Other than the obvious utility of the new aggregation framework in MongoDB, there is another reason to mention this post: You should use only as much aggregation or in topic map terminology, “merging,” as you need.

It isn’t possible to create a system that will correctly aggregate/merge all possible content. Take that as a given.

In part because new semantics are emerging every day and there are too many previous semantics that are poorly documented or unknown.

What we can do is establish requirements for particular semantics for given tasks and document those to facilitate their possible re-use in the future.

Aggregation in MongoDB (Part 1)

Monday, June 4th, 2012

Aggregation in MongoDB (Part 1) by Rick Copeland.

From the post:

In some previous posts on mongodb and python,
pymongo, and gridfs, I introduced the NoSQL database MongoDB how to use it from Python, and how to use it to store large (more than 16 MB) files in it. Here, I’ll be showing you a few of the features that the current (2.0) version of MongoDB includes for performing aggregation. In a future post, I’ll give you a peek into the new aggregation framework included in MongoDB version 2.1.

An index “aggregates” information about a subject (called an ‘entry’), where the information is traditionally found between the covers of a book.

MongoDB offers predefined as well as custom “aggregations,” where the information field can be larger than a single book.

Good introduction to aggregation in MongoDB, although you (and I) really should get around to reading the MondoDB documentation.

City Dashboard: Aggregating All Spatial Data for Cities in the UK

Saturday, April 28th, 2012

City Dashboard: Aggregating All Spatial Data for Cities in the UK

You need to try this out for yourself before reading the rest of this post.

Go ahead, I’ll wait…, …, …, ok.

To some extent this “aggregation” may reflect on the sort of questions we ask users about topic maps.

It’s possible to aggregate data about anything number of things. But even if you could, would you want to?

Take the “aggregation” for Birmingham, UK, this evening. One of the components informed me a choir director was arrested for rape. Concerns the choir director a good bit but why it would interest me?

Isn’t that the problem of aggregation? The definition of “useful” aggregation varies from person to person, even task to task.

Try London while you are at the site. There is a Slightly Unhappier/Significantly Unhappier, “Mood” indicator. It has what turns out to be a “count down” timer, for the next reset on the indicator.

I thought the changing count reflected people becoming more and more unhappy.

Looked like London was going to “flatline” while I was watching. 😉

Fortunately turned out to not be the case.

There are dangers to personalization but aggregation without relevance just pumps up the noise.

Not sure that helps either.


The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Thursday, February 9th, 2012

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework by Davy Suvee.

From the post:

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

Does it occur to you that aggregation results in one or more aggregates? And if we are presented with one or more aggregates, we could persist those aggregates and add properties to them. Or have relationships between aggregates. Or point to occurrences of aggregates.

Kristina Chodorow demonstrated use of aggregation in MongoDB in Hacking Chess with the MongoDB Pipeline for analysis of chess games. Rather that summing the number of games in which the move “e4” is the first move for White, links to all 696 games could be treated as occurrences of that subject. Which would support discovery of the player of White as well as Black.

Think of aggregation as a flexible means for merging information about subjects and their relationships. (Blind interchange requires more but this is a step in the right direction.)

The Comments Conundrum

Sunday, February 5th, 2012

The Comments Conundrum by Kristina Chodorow.

From the post:

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

Kristina continues her coverage of the aggregation pipeline in MongoDB.

Question: What is the result of an aggregation? (In a topic map sense?)

Hacking Chess with the MongoDB Pipeline

Thursday, February 2nd, 2012

Hacking Chess with the MongoDB Pipeline

Kristina Chodorow* writes:

MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.

Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.

First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).

You can use mongoimport to import games.json into MongoDB:

If you think this example of “aggregation” as merging where the subjects have a uniform identifier (chess piece/move), you will understand why I find this interesting.

Aggregation, as is shown by Kristina’s post, can form the basis for analysis of data.

Analysis that isn’t possible in the absence of aggregation (read merging).

I am looking forward to addition posts on the aggregation framework and need to drop by the MongoDB project to see what the future holds on aggregation/merging.

*Kristina is the author of two O’Reilly titles, MongoDB: the definitive guide and Scaling MongoDB.

Aggregation and Restructuring data (from “R in Action”)

Tuesday, January 10th, 2012

Aggregation and Restructuring data (from “R in Action”) by Dr. Robert I. Kabacoff.

From the post:

R provides a number of powerful methods for aggregating and reshaping data. When you aggregate data, you replace groups of observations with summary statistics based on those observations. When you reshape data, you alter the structure (rows and columns) determining how the data is organized. This article describes a variety of methods for accomplishing these tasks.

We’ll use the mtcars data frame that’s included with the base installation of R. This dataset, extracted from Motor Trend magazine (1974), describes the design and performance characteristics (number of cylinders, displacement, horsepower, mpg, and so on) for 34 automobiles. To learn more about the dataset, see help(mtcars).

How do you recognize what data you want to aggregate or transpose?

Or communicate that knowledge to future users?

The data set for Motor Trend magazine is an easy one.

If you have access to the electronic text for Motor Trend magazine (one or more issues) for 1974, drop me a line. I am thinking of a way to illustrate the “semantic” problem.

SQL to MongoDB: An Updated Mapping

Saturday, December 17th, 2011

SQL to MongoDB: An Updated Mapping from Kristina Chodorow.

From the post:

The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.

In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.

If you are interested in MongoDB-based solutions, this will be very interesting.


Thursday, December 8th, 2011

QL.IO – A declarative, data-retrieval and aggregation gateway for quickly consuming HTTP APIs.

From the about page:

A SQL and JSON inspired DSL

SQL is quite a powerful DSL to retrieve, filter, project, and join data — see efforts like A co-Relational Model of Data for Large Shared Data Banks, LINQ, YQL, or unQL for examples. combines SQL, JSON, and a few procedural style constructs into a compact language. Scripts written in this language can make HTTP requests to retrieve data, perform joins between API responses, project responses, or even make requests in a loop. But note that’s scripting language is not SQL – it is SQL inspired.


Most real-world client apps need to mashup data from multiple APIs in one go. Data mashup is often complicated as client apps need to worry about order of requests, inter-dependencies, error handling, and parallelization to reduce overall latency.’s scripts are procedural in appearance but are executed out of order based on dependencies. Some statements may be scheduled in parallel and some in series based on a dependency analysis done at script compile time. The compilation is an on-the-fly process.

Consumer Centric Interfaces

APIs are designed for reuse, and hence they cater to the common denominator. Getting new fields added, optimizing responses, or combining multiple requests into one involve drawn out negotiations between API producing teams and API consuming teams. lets API consuming teams move fast by creating consumer-centric interfaces that are optimized for the client – such optimized interfaces can reduce bandwidth usage and number of HTTP requests.

I can believe the “SQL inspired” part since it looks like keys/column headers are opaque. That is you an specify a key/column header but you can’t specify the identity of the subject it represents.

So, if you don’t know the correct term, you are SOL. Which isn’t the state of being inspired.

Still, it looks like an interesting effort that could develop to be non-opaque with regard to keys and possibly values. (The next stage is how do you learn what properties a subject representative has for the purpose of subject recognition.)

SEISA: set expansion by iterative similarity aggregation

Friday, April 1st, 2011

SEISA: set expansion by iterative similarity aggregation by Yeye He, University of Wisconsin-Madison, Madison, WI, USA, and Dong Xin, Microsoft Research, Redmond, WA, USA.

In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use “Canon” and “Nikon” as seed entities, and derive other entities (e.g., “Olympus”) in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.

To the uses of set expansion mentioned by the authors:

Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools [13]

I would add:

  • augmented authoring of navigation tools for text corpora
  • discovery of related entities (for associations)

While the authors concentrate on web-based documents, which for the most part are freely available, the techniques shown here could be just as easily applied to commercial texts or used to generate pay-for-view results.

It would have to really be a step up to get people to pay a premium for navigation of free content, but given the noisy nature of most information sites, that is certainly possible.

Unified analysis of streaming news

Thursday, March 31st, 2011

Unified analysis of streaming news by Amr Ahmed, Qirong Ho, Jacob Eisenstein, and, Eric Xing Carnegie Mellon University, Pittsburgh, USA, and Alexander J. Smola and Choon Hui Teo of Yahoo! Research, Santa Clara, CA, USA.

News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.

From the article:

Such an approach combines the strengths of clustering and topic models. We use topics to describe the content of each cluster, and then we draw articles from the associated story. This is a more natural fit for the actual process of how news is created: after an event occurs (the story), several journalists write articles addressing various aspects of the story. While their vocabulary and their view of the story may differ, they will by necessity agree on the key issues related to a story (at least in terms of their vocabulary). Hence, to analyze a stream of incoming news we need to infer a) which (possibly new) cluster could have generated the article and b) which topic mix describes the cluster best.

I single out that part of the paper to remark that at first the authors say that the vocabulary for a story may vary and then in the next breath say that for key issues the vocabulary will agree on key issues.

Given the success of their results, it may be that news reporting is more homogeneous in its vocabulary than other forms of writing?

Perhaps news compression where duplicated content is suppressed but the “fact” of reportage is retained, that could make an interesting topic map.

Provenance for Aggregate Queries

Friday, January 7th, 2011

Provenance for Aggregate Queries Authors: Yael Amsterdamer, Daniel Deutch, Val Tannen


We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for “simple” queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

Not for the faint of heart reading.

But, provenance for merging is one obvious application of this paper.

For that matter, provenance should also be a consideration for TMQL.