Archive for the ‘Data Streams’ Category

Stock Trading – The Movie

Wednesday, May 15th, 2013

How Half a Second of High Frequency Stock Trading Looks Like by Andrew Vande Moere.

stock trading

If you fancy your application as handling data at velocity with a capital V, you need to see the movie of half a second of stock trades.

The rate is slowed down so you can see the trades at millisecond intervals.

From the post:

In the movie, one can observe how High Frequency Traders (HFT) jam thousands of quotes at the millisecond level, and how every exchange must process every quote from the others for proper trade through price protection. This complex web of technology must run flawlessly every millisecond of the trading day, or arbitrage (HFT profit) opportunities will appear. However, it is easy for HFTs to cause delays in one or more of the connections between each exchange. Yet if any of the connections are not running perfectly, High Frequency Traders tend to profit from the price discrepancies that result.

See Andrew’s post for the movie and more details.

Introducing Drake, a kind of ‘make for data’

Sunday, April 28th, 2013

Introducing Drake, a kind of ‘make for data’ by Aaron Crow.

From the post:

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

A really nice introduction to Drake, with a simple example and pointers to more complete resources.

Not hard to see how Drake could fit into a topic map authoring work flow.

Massive online data stream mining with R

Tuesday, March 26th, 2013

Massive online data stream mining with R

From the post:

A few weeks ago, the stream package has been released on CRAN. It allows to do real time analytics on data streams. This can be very usefull if you are working with large datasets which are already hard to put in RAM completely, let alone to build some statistical model on it without getting into RAM problems.

The stream package is currently focussed on clustering algorithms available in MOA (http://moa.cms.waikato.ac.nz/details/stream-clustering/) and also eases interfacing with some clustering already available in R which are suited for data stream clustering. Classification algorithms based on MOA are on the todo list. Current available clustering algorithms are BIRCH, CluStream, ClusTree, DBSCAN, DenStream, Hierarchical, Kmeans and Threshold Nearest Neighbor.

What if data were always encountered as a stream?

Could request a “re-streaming” of data but best to do analysis in one streaming.

How would that impact your notion of subject identity?

How would you compensate for information learned later in the stream?

Spundge

Saturday, December 1st, 2012

First look: Spundge is software to help journalists to manage real-time data streams by Andrew Phelps.

From the post:

“Spundge is a platform that’s built to take a journalist from information discovery and tracking all the way to publishing, regardless of whatever internal systems they have to contend with,” he told me.

A user creates notebooks to organize material (a scheme familiar to Evernote users). Inside a notebook, a user can add streams from multiple sources and activate filters to refine by keyword, time (past few minutes, last week), location, and language.

Spundge extracts links from those sources and displays headlines and summaries in a blog-style river. A user can choose to save individual items to the notebook or hide them from view, and Spundge’s algorithms begin to learn what kind of content to show more or less of. A user can also save clippings from around the web with a bookmarklet (another Evernote-like feature). If a notebook is public, the stream can be embedded in webpages, à la Storify. (Here’s an example of a notebook tracking the ONA 2012 conference.)

Looks interesting but I wonder about the monochrome view it presents the user?

That is some particular user makes their settings and until and unless they change those settings, the limits of the content they are shown is measured by that user.

As opposed to say a human curated source like the New York Times. (Give me human editors and the New York Times)

Or is the problem a lack of human curated data feeds?

Rx for Asychronous Data Streams in the Clouds

Wednesday, November 7th, 2012

Claudio Caldato wrote: MS Open Tech Open Sources Rx (Reactive Extensions) – a Cure for Asynchronous Data Streams in Cloud Programming.

I was tired by the time I got to the end of the title! His is more descriptive than mine but if you know the context, you don’t need the description.

From the post:

If you are a developer that writes asynchronous code for composite applications in the cloud, you know what we are talking about, for everybody else Rx Extensions is a set of libraries that makes asynchronous programming a lot easier. As Dave Sexton describes it, “If asynchronous spaghetti code were a disease, Rx is the cure.”

Reactive Extensions (Rx) is a programming model that allows developers to glue together asynchronous data streams. This is particularly useful in cloud programming because helps create a common interface for writing applications that come from diverse data sources, e.g., stock quotes, Tweets, computer events, Web service requests.

Today, Microsoft Open Technologies, Inc., is open sourcing Rx. Its source code is now hosted on CodePlex to increase the community of developers seeking a more consistent interface to program against, and one that works across several development languages. The goal is to expand the number of frameworks and applications that use Rx in order to achieve better interoperability across devices and the cloud.

Rx was developed by Microsoft Corp. architect Erik Meijer and his team, and is currently used on products in various divisions at Microsoft. Microsoft decided to transfer the project to MS Open Tech in order to capitalize on MS Open Tech’s best practices with open development.

There are applications that you probably touch every day that are using Rx under the hood. A great example is GitHub for Windows.

According to Paul Betts at GitHub, “GitHub for Windows uses the Reactive Extensions for almost everything it does, including network requests, UI events, managing child processes (git.exe). Using Rx and ReactiveUI, we’ve written a fast, nearly 100% asynchronous, responsive application, while still having 100% deterministic, reliable unit tests. The desktop developers at GitHub loved Rx so much, that the Mac team created their own version of Rx and ReactiveUI, called ReactiveCocoa, and are now using it on the Mac to obtain similar benefits.”

What if the major cloud players started competing on the basis of interoperability? So your app here will work there.

Reducing the impedance for developers enables more competition between developers. Resulting in better services/product for consumers.

Cloud owners get more options to offer their customers.

Topic map applications have an easier time mining, identifying and recombining subjects across diverse sources and even clouds.

Does anyone see a downside here?

Wrapping Up TimesOpen: Sockets and Streams

Saturday, September 15th, 2012

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

Sockets and Streams [Registration Open - Event 12 September - Hurry]

Tuesday, September 4th, 2012

Sockets and Streams

Wednesday, September 12
7 p.m.–10 p.m

The New York Times
620 Eighth Avenue
New York, NY
15th Floor

From the webpage:

Explore innovations in real-time web systems and content, as well as related topics in interaction design.

Nice way to spend an evening in New York City!

Expect to hear good reports!

Building LinkedIn’s Real-time Activity Data Pipeline

Thursday, August 16th, 2012

Building LinkedIn’s Real-time Activity Data Pipeline by Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, and Victor Yang Ye. (pdf)

Abstract:

One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines present new design challenges. This paper discusses the design and engineering problems we encountered in moving LinkedIn’s data pipeline from a batch-oriented file aggregation mechanism to a real-time publish-subscribe system called Kafka. This pipeline currently runs in production at LinkedIn and handles more than 10 billion message writes each day with a sustained peak of over 172,000 messages per second. Kafka supports dozens of subscribing systems and delivers more than 55 billion messages to these consumer processing each day. We discuss the origins of this systems, missteps on the path to real-time, and the design and engineering problems we encountered along the way.

More details on Kafka (see Choking Cassandra Bolt).

What if you think about message feeds as being pipelines that are large enough to see and configure?

Chip level pipelines are more efficient but harder to configure.

Perhaps passing messages is efficient and flexible enough for a class of use cases.

Themes in streaming algorithms (workshop at TU Dortmund)

Saturday, August 11th, 2012

Themes in streaming algorithms (workshop at TU Dortmund) by Anna C. Gilbert.

From the post:

I recently attended the streaming algorithms workshop at Technische Universitat Dortumund. It was a follow-on to the very successful series of streaming algorithms workshops held in Kanpur over the last six years. Suresh and Andrew have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a high-level overview of what themes there are in streaming algorithms research these days, to identify new areas of research and to highlight advances in existing areas.

Anna gives the briefest of summaries but I think they will entice you to look further.

Curious, how would you distinguish a “stream of data” from “read once data?”

That is in the second case you only get one pass at reading the data. Errors are just that, errors, but you can’t look back to be sure.

Is data “integrity” an artifact of small data sets and under-powered computers?

Cascading 2.0

Thursday, June 7th, 2012

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

http://www.cascading.org/downloads/

This release includes a number of new features. Specifically:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

https://github.com/Cascading

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Cascading homepage.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.

Apache Camel Tutorial

Wednesday, June 6th, 2012

If you haven’t seen Apache Camel Tutorial Business Partners (other tutorials here), you need to give it a close look:

So there’s a company, which we’ll call Acme. Acme sells widgets, in a fairly unusual way. Their customers are responsible for telling Acme what they purchased. The customer enters into their own systems (ERP or whatever) which widgets they bought from Acme. Then at some point, their systems emit a record of the sale which needs to go to Acme so Acme can bill them for it. Obviously, everyone wants this to be as automated as possible, so there needs to be integration between the customer’s system and Acme.

Sadly, Acme’s sales people are, technically speaking, doormats. They tell all their prospects, “you can send us the data in whatever format, using whatever protocols, whatever. You just can’t change once it’s up and running.”

The result is pretty much what you’d expect. Taking a random sample of 3 customers:

  • Customer 1: XML over FTP
  • Customer 2: CSV over HTTP
  • Customer 3: Excel via e-mail

Now on the Acme side, all this has to be converted to a canonical XML format and submitted to the Acme accounting system via JMS. Then the Acme accounting system does its stuff and sends an XML reply via JMS, with a summary of what it processed (e.g. 3 line items accepted, line item #2 in error, total invoice 123.45). Finally, that data needs to be formatted into an e-mail, and sent to a contact at the customer in question (“Dear Joyce, we received an invoice on 1/2/08. We accepted 3 line items totaling 123.45, though there was an error with line items #2 [invalid quantity ordered]. Thank you for your business. Love, Acme.”).

You don’t have to be a “doormat” to take data as you find it.

Intercepted communications are unlikely to use your preferred terminology for locations or actions. Ditto for web/blog pages.

If you are thinking about normalization of data streams by producing subject-identity enhanced data streams, then you are thinking what I am thinking about Apache Camel.

For further information:

Apache Camel Documentation

Apache Camel homepage

From Tweets to Results: How to obtain, mine, and analyze Twitter data

Thursday, May 31st, 2012

From Tweets to Results: How to obtain, mine, and analyze Twitter data by Derek Ruths (McGill University).

Description:

Since Twitter’s creation in 2006, it has become one of the most popular microblogging platforms in the world. By virtue of its popularity, the relative structural simplicity of Twitter posts, and a tendency towards relaxed privacy settings, Twitter has also become a popular data source for research on a range of topics in sociology, psychology, political science, and anthropology. Nonetheless, despite its widespread use in the research community, there are many pitfalls when working with Twitter data.

In this day-long workshop, we will lead participants through the entire Twitter-based research pipeline: from obtaining Twitter data all the way through performing some of the sophisticated analyses that have been featured in recent high-profile publications. In the morning, we will cover the nuts and bolts of obtaining and working with a Twitter dataset including: using the Twitter API, the firehose, and rate limits; strategies for storing and filtering Twitter data; and how to publish your dataset for other researchers to use. In the afternoon, we will delve into techniques for analyzing Twitter content including constructing retweet, mention, and follower networks; measuring the sentiment of tweets; and inferring the gender of users from their profiles and unstructured text.

We assume that participants will have little to no prior experience with mining Twitter or other social network datasets. As the workshop will be interactive, participants are encouraged to bring a laptop. Code examples and exercises will be given in Python, thus participants should have some familiarity with the language. However, all concepts and techniques covered will be language-independent, so any individual with some background in scripting or programming will benefit from the workshop.

Any plans to use Twitter feeds for your topic maps?

I first saw a reference to this workshop at: Do you haz teh (twitter) codez? by Ricard Nielson.

Annotations in Data Streams

Sunday, March 18th, 2012

Annotations in Data Streams by Amit Chakrabarti, Graham Cormode, Andrew McGregor, and Justin Thaler.

Abstract:

The central goal of data stream algorithms is to process massive streams of data using sublinear storage space. Motivated by work in the database community on outsourcing database and data stream processing, we ask whether the space usage of such algorithms can be further reduced by enlisting a more powerful “helper” who can annotate the stream as it is read. We do not wish to blindly trust the helper, so we require that the algorithm be convinced of having computed a correct answer. We show upper bounds that achieve a non-trivial tradeoff between the amount of annotation used and the space required to verify it. We also prove lower bounds on such tradeoffs, often nearly matching the upper bounds, via notions related to Merlin-Arthur communication complexity. Our results cover the classic data stream problems of selection, frequency moments, and fundamental graph problems such as triangle-freeness and connectivity. Our work is also part of a growing trend — including recent studies of multi-pass streaming, read/write streams and randomly ordered streams — of asking more complexity-theoretic questions about data stream processing. It is a recognition that, in addition to practical relevance, the data stream model raises many interesting theoretical questions in its own right.

I have a fairly simple question as I start to read this paper: When is digital data not a stream?

When it is read from a memory device, it is a stream.

When it is read into a memory device, it is a stream.

When it is read into a cache on a CPU, it is a stream.

When it is read from the cache by a CPU, it is a stream.

When it is placed back in a cache by a CPU, it is a stream.

What would you call digital data on a storage device? May not be a stream but you can’t look at it without it becoming a stream. Yes?

The Britney Spears Problem

Wednesday, July 20th, 2011

The Britney Spears Problem by Brian Hayes.

From the article:

Back in 1999, the operators of the Lycos Internet portal began publishing a weekly list of the 50 most popular queries submitted to their Web search engine. Britney Spears—initially tagged a “teen songstress,” later a “pop tart“—was No. 2 on that first weekly tabulation. She has never fallen off the list since then—440 consecutive appearances when I last checked. Other perennials include ­Pamela Anderson and Paris Hilton. What explains the enduring popularity of these celebrities, so famous for being famous? That’s a fascinating question, and the answer would doubtless tell us something deep about modern culture. But it’s not the question I’m going to take up here. What I’m trying to understand is how we can know Britney’s ranking from week to week. How are all those queries counted and categorized? What algorithm tallies them up to see which terms are the most frequent? (emphasis added)

Deeply interesting discussion on the analysis of stream data and algorithms for the same. Very much worth a close read if you are working on or interested in such issues.

The article concludes:

All this mathematics and algorithmic engineering seems like a lot of work just for following the exploits of a famous “pop tart.” But I like to think the effort might be justified. Years from now, someone will type “Britney Spears” into a search engine and will stumble upon this article listed among the results. Perhaps then a curious reader will be led into new lines of inquiry. (emphasis added)

But what if the user enters “pop tart?” Will they still find this article? Or will it be “hit” number 100,000, which almost no one reaches? As of 20 July 2011, there were some 13 million “hits” for “pop tart” on a popular search engine. I suspect at least some of them are not about Britney Spears.

So, should I encounter a resource about Britney Spears, using the term “pop tart,” how am I going to accumulate those up for posterity?

Or do we all have to winnow search chaff for ourselves?*

*Question for office managers: How much time do you think your staff spends winnowing search chaff already winnowed by another user in your office?

Data Serialization

Tuesday, May 17th, 2011

Three Reasons Why Apache Avro Data Serialization is a Good Choice for OpenRTB

From the post:

I recently evaluated several serialization frameworks including Thrift, Protocol Buffers, and Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.

If you are in need of a data serialization framework this is a good place to start reading.

Annotations: dynamic semantics in stream processing

Monday, May 16th, 2011

Annotations: dynamic semantics in stream processing by Juan Amiguet, Andreas Wombacher, and, Tim E. Klifman.

Abstract:

In the field of e-science stream data processing is common place facilitating sensor networks, in particular for prediction and supporting decision making. However, sensor data may be erroneous, like e.g. due to measurement errors (outliers) or changes of the environment. While it can be foreseen that there will be outliers, there are a lot of environmental changes which are not foreseen by scientists and therefore are not considered in the data processing. However, these unforeseen semantic changes – represented as annotations – have to be propagated through the processing. Since the annotations represent an unforeseen, hence un-understandable, annotation, the propagation has to be independent of the annotation semantics. It nevertheless has to preserve the significance of the annotation on the data despite structural and temporal transformations. And should remain meaningful for a user at the end of the data processing. In this paper, we identify the relevant research questions.In particular, the propagation of annotations is based on structural, temporal, and significance contribution. While the consumption of the annotation by the user is focusing on clustering information to ease accessibility.

Examines the narrow case of temperature sensors but I don’t know of any reason why semantic change could not occur in a stream of data reported by a person.

Data Stream Mining Techniques

Wednesday, May 11th, 2011

An analytical framework for data stream mining techniques based on challenges and requirements by Mahnoosh Kholghi and Mohammadreza Keyvanpour.

Abstract:

A growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Generally, two main challenges are designing fast mining methods for data streams and need to promptly detect changing concepts and data distribution because of highly dynamic nature of data streams. The goal of this article is to analyze and classify the application of diverse data mining techniques in different challenges of data stream mining. In this paper, we present the theoretical foundations of data stream analysis and propose an analytical framework for data stream mining techniques.

The paper is an interesting collection of work on mining data streams and its authors should be encouraged to continue their research in this field.

However, the current version is in serious need of editing, both in terms of language usage but organizationally as well. For example, it is hard to relate table 2 (data stream mining techniques) to the analytical framework that was the focus of the article.