Archive for the ‘Data Streams’ Category

Streaming 101 & 102 – [Stream Processing with Batch Identities?]

Sunday, February 21st, 2016

The world beyond batch: Streaming 101 by Tyler Akidau.

From part 1:

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

  • Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
  • The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
  • Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

  1. Streaming 101: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
  2. The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

The world beyond batch: Streaming 102

In this post, I want to focus further on the data-processing patterns from last time, but in more detail, and within the context of concrete examples. The arc of this post will traverse two major sections:

  • Streaming 101 Redux: A brief stroll back through the concepts introduced in Streaming 101, with the addition of a running example to highlight the points being made.
  • Streaming 102: The companion piece to Streaming 101, detailing additional concepts that are important when dealing with unbounded data, with continued use of the concrete example as a vehicle for explaining them.

By the time we’re finished, we’ll have covered what I consider to be the core set of principles and concepts required for robust out-of-order data processing; these are the tools for reasoning about time that truly get you beyond classic batch processing.

You should also catch the paper by Tyler and others, The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing.

Cloud Dataflow, known as Beam at the Apache incubator, offers a variety of operations for combining and/or merging collections of values in data.

I mention that because I would hate to hear of you doing stream processing with batch identities. You know, where you decide on some fixed set of terms and those are applied across dynamic data.

Hmmm, fixed terms applied to dynamic data. Doesn’t even sound right does it?

Sometimes, fixed terms (read schema, ontology) are fine but in linguistically diverse environments (read real life), that isn’t always adequate.

Enjoy the benefits of stream processing but don’t artificially limit them with batch identities.

I first saw this in a tweet by Bob DuCharme.

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop

Monday, June 1st, 2015

Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop by Ted Malaska.

From the post:

Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.

The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.

In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.

Streaming Patterns

The four basic streaming patterns (often used in tandem) are:

  • Stream ingestion: Involves low-latency persisting of events to HDFS, Apache HBase, and Apache Solr.
  • Near Real-Time (NRT) Event Processing with External Context: Takes actions like alerting, flagging, transforming, and filtering of events as they arrive. Actions might be taken based on sophisticated criteria, such as anomaly detection models. Common use cases, such as NRT fraud detection and recommendation, often demand low latencies under 100 milliseconds.
  • NRT Event Partitioned Processing: Similar to NRT event processing, but deriving benefits from partitioning the data—like storing more relevant external information in memory. This pattern also requires processing latencies under 100 milliseconds.
  • Complex Topology for Aggregations or ML: The holy grail of stream processing: gets real-time answers from data with a complex and flexible set of operations. Here, because results often depend on windowed computations and require more active data, the focus shifts from ultra-low latency to functionality and accuracy.

In the following sections, we’ll get into recommended ways for implementing such patterns in a tested, proven, and maintainable way.

Great post on patterns for near real-time data processing.

What I have always wondered about is how much of a use case is there for “near real-time processing” of data? If human decision makers are in the loop, that is outside of ecommerce and algorithmic trading, what is the value-add of “near real-time processing” of data?

For example, Kai Wähner in Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse gives the following as common use cases for “near real-time processing” of data”

  • Network monitoring
  • Intelligence and surveillance
  • Risk management
  • E-commerce
  • Fraud detection
  • Smart order routing
  • Transaction cost analysis
  • Pricing and analytics
  • Market data management
  • Algorithmic trading
  • Data warehouse augmentation

Ecommerce, smart order routing, algorithmic trading all fall into the category of no human involved so those may need real-time processing.

But take network monitoring for example. From the news reports I understand that hackers had free run of the Sony network for months. I suppose you must have network monitoring at all before real-time network monitoring would be useful at all.

I would probe to make sure that “real-time” was necessary for the use cases at hand before simply assuming it. In smaller organizations, access to data and “real-time” results are more often a symptom of control issues as opposed to any actual use case for the data.

Open Challenges for Data Stream Mining Research

Friday, October 3rd, 2014

Open Challenges for Data Stream Mining Research, SIGKDD Explorations, Volume 16, Number 1, June 2014.


Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, over-looking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

Under entity stream mining, the authors describe the challenge of aggregation:

The first challenge of entity stream mining task concerns information summarization: how to aggregate into each entity e at each time point t the information available on it from the other streams? What information should be stored for each entity? How to deal with differences in the speeds of individual streams? How to learn over the streams efficiently? Answering those questions in a seamless way would allow us to deploy conventional stream mining methods for entity stream mining after aggregation.

Sounds remarkably like an issue for topic maps doesn’t it? Well, not topic maps in the sense that every entity has an IRI subjectIdentifier but in the sense that merging rules define the basis on which two or more entities are considered to represent the same subject.

The entire issue is on “big data” and if you are looking for research “gaps,” it is a great starting point. Table of Contents: SIGKDD explorations, Volume 16, Number 1, June 2014.

I included the TOC link because for reasons only known to staff at the ACM, the articles in this issue don’t show up in the library index. One of the many “features” of the ACM Digital Library.

In addition to the committee which oversees the Digital Library being undisclosed to members and available for contact only by staff.

Visual Sedimentation

Sunday, October 20th, 2013

Visual Sedimentation

From the webpage:

VisualSedimentation.js is a JavaScript library for visualizing streaming data, inspired by the process of physical sedimentation. Visual Sedimentation is built on top of existing toolkits such as D3.js (to manipulate documents based on data), jQuery (to facilitate HTML and Javascript development) and Box2DWeb (for physical world simulation).

I had trouble with the video but what I saw was very impressive!

From the Visual Sedimentation paper by Samuel Huron, Romain Vuillemot, and Jean-Daniel Fekete:


We introduce Visual Sedimentation, a novel design metaphor for visualizing data streams directly inspired by the physical process of sedimentation. Visualizing data streams (e. g., Tweets, RSS, Emails) is challenging as incoming data arrive at unpredictable rates and have to remain readable. For data streams, clearly expressing chronological order while avoiding clutter, and keeping aging data visible, are important. The metaphor is drawn from the real-world sedimentation processes: objects fall due to gravity, and aggregate into strata over time. Inspired by this metaphor, data is visually depicted as falling objects using a force model to land on a surface, aggregating into strata over time. In this paper, we discuss how this metaphor addresses the specific challenge of smoothing the transition between incoming and aging data. We describe the metaphor’s design space, a toolkit developed to facilitate its implementation, and example applications to a range of case studies. We then explore the generative capabilities of the design space through our toolkit. We finally illustrate creative extensions of the metaphor when applied to real streams of data.

If you are processing data streams, definitely worth a close look!

Stock Trading – The Movie

Wednesday, May 15th, 2013

How Half a Second of High Frequency Stock Trading Looks Like by Andrew Vande Moere.

stock trading

If you fancy your application as handling data at velocity with a capital V, you need to see the movie of half a second of stock trades.

The rate is slowed down so you can see the trades at millisecond intervals.

From the post:

In the movie, one can observe how High Frequency Traders (HFT) jam thousands of quotes at the millisecond level, and how every exchange must process every quote from the others for proper trade through price protection. This complex web of technology must run flawlessly every millisecond of the trading day, or arbitrage (HFT profit) opportunities will appear. However, it is easy for HFTs to cause delays in one or more of the connections between each exchange. Yet if any of the connections are not running perfectly, High Frequency Traders tend to profit from the price discrepancies that result.

See Andrew’s post for the movie and more details.

Introducing Drake, a kind of ‘make for data’

Sunday, April 28th, 2013

Introducing Drake, a kind of ‘make for data’ by Aaron Crow.

From the post:

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

A really nice introduction to Drake, with a simple example and pointers to more complete resources.

Not hard to see how Drake could fit into a topic map authoring work flow.

Massive online data stream mining with R

Tuesday, March 26th, 2013

Massive online data stream mining with R

From the post:

A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

The stream package is currently focussed on clustering algorithms available in MOA ( and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list. Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.

What if data were always encountered as a stream?

Could request a “re-streaming” of data but best to do analysis in one streaming.

How would that impact your notion of subject identity?

How would you compensate for information learned later in the stream?


Saturday, December 1st, 2012

First look: Spundge is software to help journalists to manage real-time data streams by Andrew Phelps.

From the post:

“Spundge is a platform that’s built to take a journalist from information discovery and tracking all the way to publishing, regardless of whatever internal systems they have to contend with,” he told me.

A user creates notebooks to organize material (a scheme familiar to Evernote users). Inside a notebook, a user can add streams from multiple sources and activate filters to refine by keyword, time (past few minutes, last week), location, and language.

Spundge extracts links from those sources and displays headlines and summaries in a blog-style river. A user can choose to save individual items to the notebook or hide them from view, and Spundge’s algorithms begin to learn what kind of content to show more or less of. A user can also save clippings from around the web with a bookmarklet (another Evernote-like feature). If a notebook is public, the stream can be embedded in webpages, à la Storify. (Here’s an example of a notebook tracking the ONA 2012 conference.)

Looks interesting but I wonder about the monochrome view it presents the user?

That is some particular user makes their settings and until and unless they change those settings, the limits of the content they are shown is measured by that user.

As opposed to say a human curated source like the New York Times. (Give me human editors and the New York Times)

Or is the problem a lack of human curated data feeds?

Rx for Asychronous Data Streams in the Clouds

Wednesday, November 7th, 2012

Claudio Caldato wrote: MS Open Tech Open Sources Rx (Reactive Extensions) – a Cure for Asynchronous Data Streams in Cloud Programming.

I was tired by the time I got to the end of the title! His is more descriptive than mine but if you know the context, you don’t need the description.

From the post:

If you are a developer that writes asynchronous code for composite applications in the cloud, you know what we are talking about, for everybody else Rx Extensions is a set of libraries that makes asynchronous programming a lot easier. As Dave Sexton describes it, “If asynchronous spaghetti code were a disease, Rx is the cure.”

Reactive Extensions (Rx) is a programming model that allows developers to glue together asynchronous data streams. This is particularly useful in cloud programming because helps create a common interface for writing applications that come from diverse data sources, e.g., stock quotes, Tweets, computer events, Web service requests.

Today, Microsoft Open Technologies, Inc., is open sourcing Rx. Its source code is now hosted on CodePlex to increase the community of developers seeking a more consistent interface to program against, and one that works across several development languages. The goal is to expand the number of frameworks and applications that use Rx in order to achieve better interoperability across devices and the cloud.

Rx was developed by Microsoft Corp. architect Erik Meijer and his team, and is currently used on products in various divisions at Microsoft. Microsoft decided to transfer the project to MS Open Tech in order to capitalize on MS Open Tech’s best practices with open development.

There are applications that you probably touch every day that are using Rx under the hood. A great example is GitHub for Windows.

According to Paul Betts at GitHub, “GitHub for Windows uses the Reactive Extensions for almost everything it does, including network requests, UI events, managing child processes (git.exe). Using Rx and ReactiveUI, we’ve written a fast, nearly 100% asynchronous, responsive application, while still having 100% deterministic, reliable unit tests. The desktop developers at GitHub loved Rx so much, that the Mac team created their own version of Rx and ReactiveUI, called ReactiveCocoa, and are now using it on the Mac to obtain similar benefits.”

What if the major cloud players started competing on the basis of interoperability? So your app here will work there.

Reducing the impedance for developers enables more competition between developers. Resulting in better services/product for consumers.

Cloud owners get more options to offer their customers.

Topic map applications have an easier time mining, identifying and recombining subjects across diverse sources and even clouds.

Does anyone see a downside here?

Wrapping Up TimesOpen: Sockets and Streams

Saturday, September 15th, 2012

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

Sockets and Streams [Registration Open – Event 12 September – Hurry]

Tuesday, September 4th, 2012

Sockets and Streams

Wednesday, September 12
7 p.m.–10 p.m

The New York Times
620 Eighth Avenue
New York, NY
15th Floor

From the webpage:

Explore innovations in real-time web systems and content, as well as related topics in interaction design.

Nice way to spend an evening in New York City!

Expect to hear good reports!

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)


One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

Themes in streaming algorithms (workshop at TU Dortmund)

Saturday, August 11th, 2012

Themes in streaming algorithms (workshop at TU Dortmund) by Anna C. Gilbert.

From the post:

I recently attended the streaming algorithms workshop at Technische Universitat Dortumund. It was a follow-on to the very successful series of streaming algorithms workshops held in Kanpur over the last six years. Suresh and Andrew have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a high-level overview of what themes there are in streaming algorithms research these days, to identify new areas of research and to highlight advances in existing areas.

Anna gives the briefest of summaries but I think they will entice you to look further.

Curious, how would you distinguish a “stream of data” from “read once data?”

That is in the second case you only get one pass at reading the data. Errors are just that, errors, but you can’t look back to be sure.

Is data “integrity” an artifact of small data sets and under-powered computers?

Cascading 2.0

Thursday, June 7th, 2012

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

This release includes a number of new features. Specifically:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Cascading homepage.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.

Apache Camel Tutorial

Wednesday, June 6th, 2012

If you haven’t seen Apache Camel Tutorial Business Partners (other tutorials here), you need to give it a close look:

So there’s a company, which we’ll call Acme. Acme sells widgets, in a fairly unusual way. Their customers are responsible for telling Acme what they purchased. The customer enters into their own systems (ERP or whatever) which widgets they bought from Acme. Then at some point, their systems emit a record of the sale which needs to go to Acme so Acme can bill them for it. Obviously, everyone wants this to be as automated as possible, so there needs to be integration between the customer’s system and Acme.

Sadly, Acme’s sales people are, technically speaking, doormats. They tell all their prospects, “you can send us the data in whatever format, using whatever protocols, whatever. You just can’t change once it’s up and running.”

The result is pretty much what you’d expect. Taking a random sample of 3 customers:

  • Customer 1: XML over FTP
  • Customer 2: CSV over HTTP
  • Customer 3: Excel via e-mail

Now on the Acme side, all this has to be converted to a canonical XML format and submitted to the Acme accounting system via JMS. Then the Acme accounting system does its stuff and sends an XML reply via JMS, with a summary of what it processed (e.g. 3 line items accepted, line item #2 in error, total invoice 123.45). Finally, that data needs to be formatted into an e-mail, and sent to a contact at the customer in question (“Dear Joyce, we received an invoice on 1/2/08. We accepted 3 line items totaling 123.45, though there was an error with line items #2 [invalid quantity ordered]. Thank you for your business. Love, Acme.”).

You don’t have to be a “doormat” to take data as you find it.

Intercepted communications are unlikely to use your preferred terminology for locations or actions. Ditto for web/blog pages.

If you are thinking about normalization of data streams by producing subject-identity enhanced data streams, then you are thinking what I am thinking about Apache Camel.

For further information:

Apache Camel Documentation

Apache Camel homepage

From Tweets to Results: How to obtain, mine, and analyze Twitter data

Thursday, May 31st, 2012

From Tweets to Results: How to obtain, mine, and analyze Twitter data by Derek Ruths (McGill University).


Since Twitter’s creation in 2006, it has become one of the most popular microblogging platforms in the world. By virtue of its popularity, the relative structural simplicity of Twitter posts, and a tendency towards relaxed privacy settings, Twitter has also become a popular data source for research on a range of topics in sociology, psychology, political science, and anthropology. Nonetheless, despite its widespread use in the research community, there are many pitfalls when working with Twitter data.

In this day-long workshop, we will lead participants through the entire Twitter-based research pipeline: from obtaining Twitter data all the way through performing some of the sophisticated analyses that have been featured in recent high-profile publications. In the morning, we will cover the nuts and bolts of obtaining and working with a Twitter dataset including: using the Twitter API, the firehose, and rate limits; strategies for storing and filtering Twitter data; and how to publish your dataset for other researchers to use. In the afternoon, we will delve into techniques for analyzing Twitter content including constructing retweet, mention, and follower networks; measuring the sentiment of tweets; and inferring the gender of users from their profiles and unstructured text.

We assume that participants will have little to no prior experience with mining Twitter or other social network datasets. As the workshop will be interactive, participants are encouraged to bring a laptop. Code examples and exercises will be given in Python, thus participants should have some familiarity with the language. However, all concepts and techniques covered will be language-independent, so any individual with some background in scripting or programming will benefit from the workshop.

Any plans to use Twitter feeds for your topic maps?

I first saw a reference to this workshop at: Do you haz teh (twitter) codez? by Ricard Nielson.

Annotations in Data Streams

Sunday, March 18th, 2012

Annotations in Data Streams by Amit Chakrabarti, Graham Cormode, Andrew McGregor, and Justin Thaler.


The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms can be further reduced by enlisting a more powerful “helper” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a non-trivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to Merlin-Arthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as triangle-freeness and connectivity. Our work is also part of a growing trend — including recent studies of multi-pass streaming, read/write streams and randomly ordered streams — of asking more complexity-theoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right.

I have a fairly simple question as I start to read this paper: When is digital data not a stream?

When it is read from a memory device, it is a stream.

When it is read into a memory device, it is a stream.

When it is read into a cache on a CPU, it is a stream.

When it is read from the cache by a CPU, it is a stream.

When it is placed back in a cache by a CPU, it is a stream.

What would you call digital data on a storage device? May not be a stream but you can’t look at it without it becoming a stream. Yes?

The Britney Spears Problem

Wednesday, July 20th, 2011

The Britney Spears Problem by Brian Hayes.

From the article:

Back in 1999, the operators of the Lycos Internet portal began publishing a weekly list of the 50 most popular queries submitted to their Web search engine. Britney Spears—initially tagged a “teen songstress,” later a “pop tart“—was No. 2 on that first weekly tabulation. She has never fallen off the list since then—440 consecutive appearances when I last checked. Other perennials include ­Pamela Anderson and Paris Hilton. What explains the enduring popularity of these celebrities, so famous for being famous? That’s a fascinating question, and the answer would doubtless tell us something deep about modern culture. But it’s not the question I’m going to take up here. What I’m trying to understand is how we can know Britney’s ranking from week to week. How are all those queries counted and categorized? What algorithm tallies them up to see which terms are the most frequent? (emphasis added)

Deeply interesting discussion on the analysis of stream data and algorithms for the same. Very much worth a close read if you are working on or interested in such issues.

The article concludes:

All this mathematics and algorithmic engineering seems like a lot of work just for following the exploits of a famous “pop tart.” But I like to think the effort might be justified. Years from now, someone will type “Britney Spears” into a search engine and will stumble upon this article listed among the results. Perhaps then a curious reader will be led into new lines of inquiry. (emphasis added)

But what if the user enters “pop tart?” Will they still find this article? Or will it be “hit” number 100,000, which almost no one reaches? As of 20 July 2011, there were some 13 million “hits” for “pop tart” on a popular search engine. I suspect at least some of them are not about Britney Spears.

So, should I encounter a resource about Britney Spears, using the term “pop tart,” how am I going to accumulate those up for posterity?

Or do we all have to winnow search chaff for ourselves?*

*Question for office managers: How much time do you think your staff spends winnowing search chaff already winnowed by another user in your office?

Data Serialization

Tuesday, May 17th, 2011

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

From the post:

I recently evaluated several serialization frameworks including Thrift, Protocol Buffers, and Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

If you are in need of a data serialization framework this is a good place to start reading.

Annotations: dynamic semantics in stream processing

Monday, May 16th, 2011

Annotations: dynamic semantics in stream processing by Juan Amiguet, Andreas Wombacher, and, Tim E. Klifman.


In the field of e-science stream data processing is common place facilitating sensor networks, in particular for prediction and supporting decision making. However, sensor data may be erroneous, like e.g. due to measurement errors (outliers) or changes of the environment. While it can be foreseen that there will be outliers, there are a lot of environmental changes which are not foreseen by scientists and therefore are not considered in the data processing. However, these unforeseen semantic changes – represented as annotations – have to be propagated through the processing. Since the annotations represent an unforeseen, hence un-understandable, annotation, the propagation has to be independent of the annotation semantics. It nevertheless has to preserve the significance of the annotation on the data despite structural and temporal transformations. And should remain meaningful for a user at the end of the data processing. In this paper, we identify the relevant research questions.In particular, the propagation of annotations is based on structural, temporal, and significance contribution. While the consumption of the annotation by the user is focusing on clustering information to ease accessibility.

Examines the narrow case of temperature sensors but I don’t know of any reason why semantic change could not occur in a stream of data reported by a person.

Data Stream Mining Techniques

Wednesday, May 11th, 2011

An analytical framework for data stream mining techniques based on challenges and requirements by Mahnoosh Kholghi and Mohammadreza Keyvanpour.


A growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Generally, two main challenges are designing fast mining methods for data streams and need to promptly detect changing concepts and data distribution because of highly dynamic nature of data streams. The goal of this article is to analyze and classify the application of diverse data mining techniques in different challenges of data stream mining. In this paper, we present the theoretical foundations of data stream analysis and propose an analytical framework for data stream mining techniques.

The paper is an interesting collection of work on mining data streams and its authors should be encouraged to continue their research in this field.

However, the current version is in serious need of editing, both in terms of language usage but organizationally as well. For example, it is hard to relate table 2 (data stream mining techniques) to the analytical framework that was the focus of the article.