Archive for the ‘Time Series’ Category

KairosDB

Saturday, April 6th, 2013

KairosDB

From the webpage:

KairosDB is a fast distributed scalable time series database written primarily for Cassandra but works with HBase as well.

It is a rewrite of the original OpenTSDB project started at Stumble Upon. Many thanks go out to the original authors for laying the groundwork and direction for this great product. See a list of changes here.

Because it is written on top of Cassandra (or HBase) it is very fast and scalable. With a single node we are able to capture 40,000 points of data per second.

Why do you need a time series database? The quick answer is so you can be data driven in your IT decisions. With KairosDB you can use it to track the number of hits on your web server and compare that with the load average on your MySQL database.

Getting Started

Metrics

KairosDB stores metrics. Each metric consists of a name, data points (measurements), and tags. Tags are used to classify the metric.

Metrics can be submitted to KairosDB via telnet protocol or a REST API.

Metrics can be queried using a REST API. Aggregators can be used to manipulate the data as it is returned. This allows downsampling, summing, averaging, etc.

Do be aware that values must be either longs or doubles.

If your data can be mapped into metric space, KairosDB may be quite useful.

The intersection of time series data with non-metric data or events awaits a different solution.

I first saw this at Alex Popescu’s Kairosdb – Fast Scalable Time Series Database.

Quandl [> 2 million financial/economic datasets]

Tuesday, December 25th, 2012

Quandl (alpha)

From the homepage:

Quandl is a collaboratively curated portal to over 2 million financial and economic time-series datasets from over 250 sources. Our long-term mission is to make all numerical data on the internet easy to find and easy to use.

Interesting enough but the detail from the “about” page are even more so:

Our Vision

The internet offers a rich collection of high quality numerical data on thousands of subjects. But the potential of this data is not being reached at all because the data is very difficult to actually find. Furthermore, it is also difficult to extract, validate, format, merge, and share.

We have a solution: We’re building an intelligent search engine for numerical data. We’ve developed technology that lets people quickly and easily add data to Quandl’s index. Once this happens, the data instantly becomes easy to find and easy to use because it gains 8 essential attributes:

Findability Quandl is essentially a search engine for numerical data. Every search result on Quandl is an actual data set that you can use right now. Once data from anywhere on the internet becomes known to Quandl, it becomes findable by search and (soon) by browse.
Structure Quandl is a universal translator for data formats. It accepts numerical data no matter what format it happens to be published in and then delivers it in any format you request it. When you find a dataset on Quandl, you’ll be able to export anywhere you want, in any format you want.
Validity Every dataset on Quandl has a simple link back to the same data on the publisher’s web site which gives you 100% certainty on validity.
Fusibility Any data set on Quandl is totally compatible with any and all other data on Quandl. You can merge multiple datasets on Quandl quickly and easily (coming soon).
Permanence Once a dataset is on Quandl, it stays there forever. It is always up-to-date and available at a permanent, unchanging URL.
Connectivity Every dataset on Quandl is accessible by a simple API. Whether or not the original publisher offered an API no longer matters because Quandl always does. Quandl is the universal API for numerical data on the internet.
Recency Every single dataset on Quandl is guaranteed to be the most recent version of that data, retrieved afresh directly from the original publisher.
Utility Data on Quandl is organized and presented for maximum utility: Actual data is examinable immediately; the data is graphed (properly); description, attribution, units, and export tools are clear and concise.

I have my doubts about the “fusibility” claims. You can check the US Leading Indicators data list and note that “level” and “units” use different units of measurement. Other semantic issues lurk just beneath the surface.

Still, the name of the engine does not begin with “B” or “G” and illustrates there is enormous potential for curated data collections.

Come to think of it, topic maps are curated data collections.

Are you in need of a data curator?

I first saw this in a tweet by Gregory Piatetsky.

Predicting what topics will trend on Twitter [Predicting Merging?]

Friday, November 2nd, 2012

Predicting what topics will trend on Twitter

From the post:

Twitter’s home page features a regularly updated list of topics that are “trending,” meaning that tweets about them have suddenly exploded in volume. A position on the list is highly coveted as a source of free publicity, but the selection of topics is automatic, based on a proprietary algorithm that factors in both the number of tweets and recent increases in that number.

At the Interdisciplinary Workshop on Information and Decision in Social Networks at MIT in November, Associate Professor Devavrat Shah and his student, Stanislav Nikolov, will present a new algorithm that can, with 95 percent accuracy, predict which topics will trend an average of an hour and a half before Twitter’s algorithm puts them on the list — and sometimes as much as four or five hours before.

If you can’t attend the Interdisciplinary Workshop on Information and Decision in Social Networks workshop, which has an exciting final program, try Stanislav Nikolov thesis, Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series.

Abstract:

In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation — that an observation belongs to a certain class if it sufficiently resembles other examples of that class.

We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.

This will be interesting in many classification contexts.

Particularly predicting what topics a user will say represent the same subject.

Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges

Friday, September 28th, 2012

Windows into Relational Events: Data Structures for Contiguous Subsequences of Edges by Michael J. Bannister, Christopher DuBois, David Eppstein, Padhraic Smyth.

Abstract:

We consider the problem of analyzing social network data sets in which the edges of the network have timestamps, and we wish to analyze the subgraphs formed from edges in contiguous subintervals of these timestamps. We provide data structures for these problems that use near-linear preprocessing time, linear space, and sublogarithmic query time to handle queries that ask for the number of connected components, number of components that contain cycles, number of vertices whose degree equals or is at most some predetermined value, number of vertices that can be reached from a starting set of vertices by time-increasing paths, and related queries.

Among other interesting questions, raises the issue of what time span of connections constitutes a network of interest? More than being “dynamic.” A definitional issue for the social network in question.

If you are working with social networks, a must read.

PS: You probably need to read: Relational events vs graphs, a posting by David Eppstein.

David details several different terms for “relational event data,” and says there are probably others they did not find. (Topic maps anyone?)

UCR Time Series Classification/Clustering Page

Friday, July 6th, 2012

UCR Time Series Classification/Clustering Page

I encountered this while hunting down references on the insect identification contest.

How does your thinking about topic maps or other semantic solutions fare against:

Machine learning research has, to a great extent, ignored an important aspect of many real world applications: time. Existing concept learners predominantly operate on a static set of attributes; for example, classifying flowers described by leaf size, petal colour and petal count. The values of these attributes is assumed to be unchanging — the flower never grows or loses leaves.

However, many real datasets are not “static”; they cannot sensibly be represented as a fi xed set of attributes. Rather, the examples are expressed as features that vary temporally, and it is the temporal variation itself that is used for classifi cation. Consider a simple gesture recognition domain, in which the temporal features are the position of the hands, finger bends, and so on. Looking at the position of the hand at one point in time is not likely to lead to a successful classifi cation; it is only by analysing changes in position that recognition is possible.

(Temporal Classi cation: Extending the Classi cation Paradigm to Multivariate Time Series by Mohammed Waleed Kadous (2002))

A decade old now but still a nice summary of the issue.

Can we substitute “identification” for “machine learning research?”

Are you relying “…on a static set of attributes” for identity purposes?

Basic Time Series with Cassandra

Thursday, June 21st, 2012

Basic Time Series with Cassandra

From the post:

One of the most common use cases for Cassandra is tracking time-series data. Server log files, usage, sensor data, SIP packets, stuff that changes over time. For the most part this is a straight forward process but given that Cassandra has real-world limitations on how much data can or should be in a row, there are a few details to consider.

As it says in the title, “basic” time series, the post concludes with:

Indexing and Aggregation

Indexing and aggregation of time-series data is a more complicated topic as they are highly application dependent. Various new and upcoming features of Cassandra also change the best practices for how things like aggregation are done so I won’t go into that. For more details, hit #cassandra on irc.freenode and ask around. There is usually somebody there to help.

But why would you collect time-series data if you weren’t going to index and/or aggregate it?

Anyone care to suggest “best practices?”

iSAX

Monday, April 2nd, 2012

iSAX

An extension of the SAX software for larger data sets. Detailed in: iSAX: Indexing and Mining Terabyte Sized Time Series.

Abstract:

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, it has not led to algorithms that can scale to the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we show how a novel multiresolution symbolic representation can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. Our approach allows both fast exact search and ultra fast approximate search. We show how to exploit the combination of both types of search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing millions of time series.

There are a number of data sets at this page with “…warning 500meg file.”

SAX (Symbolic Aggregate approXimation)

Monday, April 2nd, 2012

SAX (Symbolic Aggregate approXimation)

From the webpage:

SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. In classic data mining tasks such as clustering, classification, index, etc., SAX is as good as well-known representations such as Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT), while requiring less storage space. In addition, the representation allows researchers to avail of the wealth of data structures and algorithms in bioinformatics or text mining, and also provides solutions to many challenges associated with current data mining tasks. One example is motif discovery, a problem which we defined for time series data. There is great potential for extending and applying the discrete representation on a wide class of data mining tasks.

From a testimonial on the webpage:

the performance SAX enables is amazing, and I think a real breakthrough. As an example, we can find similarity searches using edit distance over 10,000 time series in 50 milliseconds. Ray Cromwell, Timepedia.org

Don’t usually see “testimonials” on an academic website but they appear to be merited in this case.

Serious similarity software. Take the time to look.

BTW, you may also be interested in a SAX time series/Shape tutorial. (120 slides about what makes SAX special.)

UCR Time Series Classification/Clustering Page

Monday, April 2nd, 2012

UCR Time Series Classification/Clustering Page

From the webpage:

This webpage has been created as a public service to the data mining/machine learning community, to encourage reproducible research for time series classification and clustering.

While chasing the details on Eamonn Keogh and his time series presentation, I encountered this collection of data sets.

dygraphs JavaScript Visualization Library

Sunday, February 13th, 2011

dygraphs JavaScript Visualization Library

From the website:

dygraphs is an open source JavaScript library that produces produces interactive, zoomable charts of time series. It is designed to display dense data sets and enable users to explore and interpret them.

If your topic map contains or can be viewed as a time series, this graphics library may be of interest to you.