## Archive for the ‘Sampling’ Category

### Speculative Popcount Data Creation

Tuesday, January 14th, 2014

Cognitive systems speculate on big data by Ravi Arimilli.

From the post:

Our brains don’t need to tell our lungs to breathe or our hearts to pump blood. Unfortunately, computers require instructions for everything they do. But what if machines could analyze big data and determine what to do, based on the content of the data, without specific instructions? Patent #8,387,065 establishes a way for computer systems to analyze data in a whole new way, using “speculative” population count (popcount) operations.

Popcount technology has been around for several years. It uses algorithms to pair down the number of traditional instructions a system has to run through to solve a problem. For example, if a problem takes 10,000 instructions to be solved using standard computing, popcount techniques can reduce the number of instructions by more than half.

This is how IBM Watson played Jeopardy! It did not need to be given instructions to look for every possible bit of data to answer a question. Its Power 7-based system used popcount operations to make assumptions about the domain of data in question, to come up with a real time answer.

Reading the patent: Patent #8,387,065, you will find this statement:

An actual method or mechanism by which the popcount is calculated is not described herein because the invention applies to any one of the various popcount algorithms that may be executed by CPU to determine a popcount. (under DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT. There are no section/paragraph numbers, etc.)

IBM patented a process to house a sampling method without ever describing the sampling method. As Ben Stein would say: “wow.”

When I think of IBM patents, I think of eWeek’s IBM Patent: 100 Years of High-Tech Innovations top ten (10) list:

Sampling methods, just like naive Bayes classifiers, work if and only if certain assumptions are met. Naive Bayes classifiers assume all features are independent. Sampling methods, on the other hand, assume a data set is uniform. Meaning that a sample is an accurate reflection of an entire data set.

Uniformity is a chancy assumption because in order to confirm that is the right choice, you have to process data that sampling allows you to avoid.

There are methods to reduce the risks of sampling but it isn’t possible to tell from IBM’s “patent” in this case which of any of them are being used.

### Inferring General Relations between Network Characteristics from Specific Network Ensembles

Sunday, June 10th, 2012

Inferring General Relations between Network Characteristics from Specific Network Ensembles by Stefano Cardanobile, Volker Pernice, Moritz Deger, and Stefan Rotter.

Abstract:

Different network models have been suggested for the topology underlying complex interactions in natural systems. These models are aimed at replicating specific statistical features encountered in real-world networks. However, it is rarely considered to which degree the results obtained for one particular network class can be extrapolated to real-world networks. We address this issue by comparing different classical and more recently developed network models with respect to their ability to generate networks with large structural variability. In particular, we consider the statistical constraints which the respective construction scheme imposes on the generated networks. After having identified the most variable networks, we address the issue of which constraints are common to all network classes and are thus suitable candidates for being generic statistical laws of complex networks. In fact, we find that generic, not model-related dependencies between different network characteristics do exist. This makes it possible to infer global features from local ones using regression models trained on networks with high generalization power. Our results confirm and extend previous findings regarding the synchronization properties of neural networks. Our method seems especially relevant for large networks, which are difficult to map completely, like the neural networks in the brain. The structure of such large networks cannot be fully sampled with the present technology. Our approach provides a method to estimate global properties of under-sampled networks in good approximation. Finally, we demonstrate on three different data sets (C. elegans neuronal network, R. prowazekii metabolic network, and a network of synonyms extracted from Roget’s Thesaurus) that real-world networks have statistical relations compatible with those obtained using regression models.

The key insight is that sampling can provide the basis for reliable estimates of global properties of networks too large to fully model.

If you don’t already get one of the ScienceDaily newsletters, you should consider it.

### Distributed Systems Tracing with Zipkin [Sampling @ Twitter w/ UI]

Saturday, June 9th, 2012

Distributed Systems Tracing with Zipkin

From the post:

Zipkin is a distributed tracing system that we created to help us gather timing data for all the disparate services involved in managing a request to the Twitter API. As an analogy, think of it as a performance profiler, like Firebug, but tailored for a website backend instead of a browser. In short, it makes Twitter faster. Today we’re open sourcing Zipkin under the APLv2 license to share a useful piece of our infrastructure with the open source community and gather feedback.

Hmmm, tracing based on the Dapper paper that comes with a web-based UI for a number of requests. Hard to beat that!

Thinking more about the sampling issue, what if I were to sample a very large stream of proxies and decided to only merge a certain percentage and pipe the rest to /dev/null?

For example, I have an UPI feed and that is my base set of “news” proxies. I have feeds from the various newspaper, radio and TV outlets around the United States. If the proxies from the non-UPI feeds are without some distance of the UPI feed proxies, they are simply discarded.

True, I am losing the information of which newspapers carried the stories, whose bylines consisted of changing the order of the words or dumbing them down, but those may not fall under my requirements.

I would rather than a few dozen very good sources than say 70,000 sources that say the same thing.

If you were testing for news coverage or the spread of news stories, your requirements might be different.

I first saw this at Alex Popescu’s myNoSQL.

### Dapper, a Large-Scale Distributed Systems Tracing Infrastructure [Data Sampling Lessons For “Big Data”]

Saturday, June 9th, 2012

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure by Benjamin H. Sigelman, Luiz Andr´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.

Abstract:

Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie [3] and X-Trace [12], but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

The main goal of this paper is to report on our experience building, deploying and using the system for over two years, since Dapper’s foremost measure of success has been its usefulness to developer and operations teams. Dapper began as a self-contained tracing tool but evolved into a monitoring platform which has enabled the creation of many different tools, some of which were not anticipated by its designers. We describe a few of the analysis tools that have been built using Dapper, share statistics about its usage within Google, present some example use cases, and discuss lessons learned so far.

A very important paper for anyone working with large and complex systems.

With lessons on data sampling as well:

we have found that a sample of just one out of thousands of requests provides sufficient information for many common uses of the tracing data.

You have to wonder in “data in the petabyte range” cases, how many of them could be reduced to gigabyte (or smaller) size with no loss in accuracy?

Which would reduce storage requirements, increase analysis speed, increase the complexity of analysis, etc.

Have you sampled your “big data” recently?

I first saw this at Alex Popescu’s myNoSQL.

### Tall Big Data, Wide Big Data

Saturday, December 17th, 2011

Tall Big Data, Wide Big Data by Luis Apiolaza.

From the post:

After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.

Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?

Luis makes a distinction between “tall data,” large data set but few predictors per item, “tall data,” versus small data set but large number of predictors per item, “wide data.” Not certain that sampling works with wide data.

Sampling wide data is a question that can be settled by experimentation. Takers?

### Mining Interesting Subgraphs….

Tuesday, November 23rd, 2010

Mining Interesting Subgraphs by Output Space Sampling

Mohammad Al Hasan’s dissertation was the winner of the SIGKDD Ph.D. Dissertation Award.

From the dissertation:

Output space sampling is an entire paradigm shift in frequent pattern mining (FPM) that holds enormous promise. While traditional FPM strives for completeness, OSS targets to obtain a few interesting samples. The definition of interestingness can be very generic, so user can sample patterns from different target distributions by choosing different interestingness functions. This is very beneficial as mined patterns are subject to subsequent use in various knowledge discovery tasks, like classification, clustering, outlier detection, etc. and the interestingness score of a pattern varies for various tasks. OSS can adapt to this requirement just by changing the interestingness function. OSS also solves pattern redundancy problem by finding samples that are very different from each other. Note that, pattern redundancy hurts any knowledge based system that builds metrics based on the structural similarity of the patterns.

Nice to see recognition that for some data sets we don’t need (or require) full enumeration of all occurrences.

Something topic map advocates need to remember when proselytizing for topic maps.

The goal is not all the information known about a subject.

The goal is all the information a user wants about a subject.

Not the same thing.

Questions:

1. What criteria of “interestingness” would you apply in gathering data for easy access by your patrons? (3-5 pages, no citations)
2. How would you use this technique for authoring and/or testing a topic map? (3-5 pages, not citations. Think of “testing” a topic map as its representativeness of a particular data collection.
3. Bibliography of material citing the paper or applying this technique.