Archive for the ‘Stream Analytics’ Category

…Efficient Approximate Data De-Duplication in Streams [Approximate Merging?]

Tuesday, December 18th, 2012

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams by Suman K. Bera, Sourav Dutta, Ankur Narang, Souvik Bhattacherjee.

Abstract:

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent.
.
In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

If you think of more than one representative for a subject as “duplication,” then merging is a special class of “deduplication.”

Deduplication that discards redundant information but that preserves unique additional information and relationships to other subjects.

As you move away from static and towards transient topic maps, representations of subjects in real time data streams, this and similar techniques will become very important.

I first saw this in a tweet from Stefano Bertolo.

PS: A new equivalent term (to me) for deduplication: “intelligent compression.” Pulls about 46K+ “hits” in a popular search engine today. May want to add it to your routine search queries.

MOA Massively Online Analysis

Saturday, December 1st, 2012

MOA Massively Online Analysis : Real Time Analytics for Data Streams

From the homepage:

What is MOA?

MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

What can MOA do for you?

MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1″ or “Moa-2012-08-manual.”

If data streams are in your present or future, definitely worth a look.

Design a Twitter Like Application with Nati Shalom

Thursday, November 1st, 2012

Design a Twitter Like Application with Nati Shalom

From the description:

Design a large scale NoSQL/DataGrid application similar to Twitter with Nati Shalom.

The use case is solved with Gigaspaces and Cassandra but other NoSQL and DataGrids solutions could be used.

Slides : xebia-video.s3-website-eu-west-1.amazonaws.com/2012-02/realtime-analytics-for-big-data-a-twitter-case-study-v2-ipad.pdf

If you enjoyed the posts I pointed to at: Building your own Facebook Realtime Analytics System, you will enjoy the video. (Same author.)

Not to mention Nati teaches patterns, the specific software being important but incidental.

Sketching and streaming algorithms for processing massive data [Heraclitus and /dev/null]

Saturday, October 27th, 2012

Sketching and streaming algorithms for processing massive data by Jelani Nelson.

From the introduction:

Several modern applications require handling data so massive that traditional algorithmic models do not provide accurate means to design and evaluate efficient algorithms. Such models typically assume that all data fits in memory, and that running time is accurately modeled as the number of basic instructions the algorithm performs. However in applications such as online social networks, large-scale modern scientific experiments, search engines, online content delivery, and product and consumer tracking for large retailers such as Amazon and Walmart, data too large to fit in memory must be analyzed. This consideration has led to the development of several models for processing such large amounts of data: The external memory model [1, 2] and cache-obliviousness [3, 4], where one aims to minimize the number of blocks fetched from disk; property testing [5], where it is assumed the data is so massive that we do not wish to even look at it all and thus aim to minimize the number of probes made into the data; and massively parallel algorithms operating in such systems as MapReduce and Hadoop [6, 7]. Also in some applications, data arrives in a streaming fashion and must be processed on the fly. Such cases arise, for example, with packet streams in network traffic monitoring, or query streams arriving at a Web-based service such as a search engine.

In this article we focus on this latter streaming model of computation, where a given algorithm must make one pass over a data set to then compute some function. We pursue such streaming algorithms, which use memory that is sublinear in the amount of data, since we assume the data is too large to fit in memory. Sometimes it becomes useful to consider algorithms that are allowed not just one, but a few passes over the data, in cases where the data set lives on disk and the number of passes may dominate the overall running time. We also occasionally discuss sketches. A sketch is with respect to some function f, and a sketch of a data set x is a compressed representation of x from which one can compute f(x). Of course under this definition f(x) is itself a valid sketch of x, but we often require more of our sketch than just being able to compute f(x). For example, we typically require that it should be possible for the sketch to be updated as more data arrives, and sometimes we also require sketches of two different data sets that are prepared independently can be compared to compute some function of the aggregate data, or similarity or difference measures across different data sets.

Our goal in this article is not to be comprehensive in our coverage of streaming algorithms. Rather, we discuss in some detail a few surprising results in order to convince the reader that it is possible to obtain some non-trivial algorithms within this model. Those interested in learning more about this area are encouraged to read the surveys [8, 9], or view the notes online for streaming courses taught by Chakrabarti at Dartmouth [10], Indyk at MIT [11], and McGregor at UMass Amherst [12].

Projections of data growth are outdated nearly as soon as they are uttered.

Suffice it to say that whatever data we are called upon to process today, will be larger next year. How much larger depends on the domain, the questions to be answered and a host of other factors. But it will be larger.

We need to develop methods of subject recognition, when like Heraclitus, we cannot ever step in the same stream twice.

If we miss it on the first pass, there isn’t going to be a second one. Next stop for some data streams is going to be /dev/null.

What approaches are you working on?

Wrapping Up TimesOpen: Sockets and Streams

Saturday, September 15th, 2012

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

Sockets and Streams [Registration Open - Event 12 September - Hurry]

Tuesday, September 4th, 2012

Sockets and Streams

Wednesday, September 12
7 p.m.–10 p.m

The New York Times
620 Eighth Avenue
New York, NY
15th Floor

From the webpage:

Explore innovations in real-time web systems and content, as well as related topics in interaction design.

Nice way to spend an evening in New York City!

Expect to hear good reports!

Streaming Data Mining Tutorial slides (and more)

Tuesday, August 21st, 2012

Streaming Data Mining Tutorial slides (and more) by Igor Carron.

From the post:

Jelani Nelson.and Edo Liberty just released an important tutorial they gave at KDD 12 on the state of the art and practical algorithms used in mining streaming data, entitled: Streaming Data Mining I personally marvel at the development of these deep algorithms which, because of the large data streams constraints, get to redefine what it means to do seemingly simple functions such as counting in the Big Data world. Here are some slides that got my interest, but the 111 pages are worth the read:

Pointers to more slides and videos follow.

Themes in streaming algorithms (workshop at TU Dortmund)

Saturday, August 11th, 2012

Themes in streaming algorithms (workshop at TU Dortmund) by Anna C. Gilbert.

From the post:

I recently attended the streaming algorithms workshop at Technische Universitat Dortumund. It was a follow-on to the very successful series of streaming algorithms workshops held in Kanpur over the last six years. Suresh and Andrew have both given excellent summaries of the individual talks at the workshop (see day 1, day 2, and day 3) so, as both a streaming algorithms insider and outsider, I thought it would be good to give a high-level overview of what themes there are in streaming algorithms research these days, to identify new areas of research and to highlight advances in existing areas.

Anna gives the briefest of summaries but I think they will entice you to look further.

Curious, how would you distinguish a “stream of data” from “read once data?”

That is in the second case you only get one pass at reading the data. Errors are just that, errors, but you can’t look back to be sure.

Is data “integrity” an artifact of small data sets and under-powered computers?

Hadoop Streaming Made Simple using Joins and Keys with Python

Monday, July 9th, 2012

Hadoop Streaming Made Simple using Joins and Keys with Python

From the post:

There are a lot of different ways to write MapReduce jobs!!!

Sample code for this post https://github.com/joestein/amaunet

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Interesting post and good tips on data exploration. Can’t really query/process the unknown.

Suggestions of other data exploration examples? (Not so much processing the known but looking to “learn” about data sources.)

Visualizing Streaming Text Data with Dynamic Maps

Monday, July 2nd, 2012

Visualizing Streaming Text Data with Dynamic Maps by Emden Gansner, Yifan Hu, and Stephen North.

Abstract:

The many endless rivers of text now available present a serious challenge in the task of gleaning, analyzing and discovering useful information. In this paper, we describe a methodology for visualizing text streams in real time. The approach automatically groups similar messages into “countries,” with keyword summaries, using semantic analysis, graph clustering and map generation techniques. It handles the need for visual stability across time by dynamic graph layout and Procrustes projection techniques, enhanced with a novel stable component packing algorithm. The result provides a continuous, succinct view of evolving topics of interest. It can be used in passive mode for overviews and situational awareness, or as an interactive data exploration tool. To make these ideas concrete, we describe their application to an online service called TwitterScope.

Or, see: TwitterScope, at http://bit.ly/HA6KIR.

Worth the visit to see the static pics in the paper in action.

Definitely a tool with a future in data exploration.

I know “Procrustes” from the classics so had to look up Procrustes transformation. Which was reported to mean:

A Procrustes transformation is a geometric transformation that involves only translation, rotation, uniform scaling, or a combination of these transformations. Hence, it may change the size, but not the shape of a geometric object.

Sounds like abuse of “Procrustes” because I would think having my limbs cut off would change my shape. ;-)

Intrigued by the notion of not changing “…the shape of a geometric object.”

Could we say that adding identifications to a subject representative does not change the subject it identifies?

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Friday, May 18th, 2012

From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

From the post:

Here is the first series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).The titles of the talks are linked to the presentation slides. The full program which ends tomorrow is here.. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

Posted by Igor Carron at Nuit Blanche.

Finding enough hours to watch all of these is going to be a problem!

Which ones do you like best?

Event Stream Processor Matrix

Thursday, October 6th, 2011

Event Stream Processor Matrix

From the post:

We published our first ever UI-focused post on Top JavaScript Dynamic Table Libraries the other day and got some valuable feedback – thanks!

We are back to talking about the backend again. Our Search Analytics and Scalable Performance Monitoring services/products accept, process, and store huge amounts of data. One thing both of these services do is process a stream of events in real-time (and batch, of course). So what solutions are there that help one process data in real-time and perform some operations on a rolling window of data, such as the last 5 or 30 minutes of incoming event stream? We know of several solutions that fit that bill, so we decided to put together a matrix with essential attributes of those tools in order to compare them and make our pick. Below is the matrix we came up with. If you are viewing this on our site, the table is likely going to be too wide, but it should look find in a proper feed reader.

Another great collection of resources from Sematext!

So, must have subjects because something is being recognized in order to have “events.” That implies to me that with subjects being recognized, it is likely enterprises have other information about those subjects that would be useful to have together, “merged” I think is the term I want. ;-)

Has anyone seen “events” recognized in one system being populated with information from another system? With different criteria for the same subject?