Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 24, 2013

Wakari.IO Web-based Python Data Analysis

Filed under: Data Analysis,Python — Patrick Durusau @ 6:20 pm

Wakari.IO Web-based Python Data Analysis

From: Continuum Analytics Launches Full-Featured, In-Browser Data Analytics Environment by Corinna Bahr.

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Wakari version 1.0, an easy-to-use, cloud-based, collaborative Python environment for analyzing, exploring and visualizing large data sets .

Hosted on Amazon’s Elastic Compute Cloud (EC2), Wakari gives users the ability to share analyses and results via IPython notebook, visualize with Matplotlib, easily switch between multiple versions of Python and its scientific libraries, and quickly collaborate on analyses without having to download data locally to their laptops or workstations. Users can share code and results as simple web URLs, from which other users can easily create their own copies to modify and explore.

Previously in beta, the version 1.0 release of Wakari boasts a number of new features, including:

  • Premium access to SSH, ipcluster configuration, and the full range of Amazon compute nodes and clusters via a drop-down menu
  • Enhanced IPython notebook support, most notably an IPython notebook gallery and an improved UI for sharing
  • Bundles for simplified sharing of files, folders, and Python library dependencies
  • Expanded Wakari documentation
  • Numerous enhancements to the user interface

This looks quite interesting. There is a free option if you are undecided.

I first saw this at: Wakari: Continuum In-Browser Data Analytics Environment.

April 12, 2013

“Almost there….” (Computing Homology)

Filed under: Data Analysis,Feature Spaces,Homology,Topological Data Analysis,Topology — Patrick Durusau @ 4:03 pm

We all remember the pilot in Star Wars that kept saying, “Almost there….” Jeremy Kun has us “almost there…” in his latest installment: Computing Homology.

To give you some encouragement, Jeremy concludes the post saying:

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis. (bold emphasis added)

Data analysts are needed at all levels.

Do you want to be a spreadsheet data analyst or something a bit harder to find?

April 11, 2013

Clojure Data Analysis Cookbook

Filed under: Clojure,Data Analysis — Patrick Durusau @ 6:01 am

Clojure Data Analysis Cookbook by Eric Rochester.

I don’t have a copy of Clojure Data Analysis Cookbook but strongly suggest that you read the sample chapter before deciding to buy it.

You will find that two chapters, Chapter 6, Working with Incanter Datasets and Chapter 7, Preparing for and Performing Statistical Data Analysis with Incanter out of eleven are focused on Incanter.

The Incanter site, incanter.org, bills itself as “Incanter Data Sorcery.”

If you go to the blog tab, you will find the most recent entry is December 29, 2010.

Twitter tab shows the most recent tweet as July 21, 2012.

The discussion tab does point to recent discussions but since the first of the year (2013) it has been lite.

I am concerned that a March, 2013 title would have two chapters on what appears to not be a very active project.

Particularly in a rapidly moving area like data analysis.

April 7, 2013

Big Data Is Not the New Oil

Filed under: BigData,Data Analysis — Patrick Durusau @ 3:05 pm

Big Data Is Not the New Oil by Jer Thorp.

From the post:

Every 14 minutes, somewhere in the world, an ad exec strides on stage with the same breathless declaration:

“Data is the new oil!”

It’s exciting stuff for marketing types, and it’s an easy equation: big data equals big oil, equals big profits. It must be a helpful metaphor to frame something that is not very well understood; I’ve heard it over and over and over again in the last two years.

The comparison, at the level it’s usually made, is vapid. Information is the ultimate renewable resource. Any kind of data reserve that exists has not been lying in wait beneath the surface; data are being created, in vast quantities, every day. Finding value from data is much more a process of cultivation than it is one of extraction or refinement.

Jer’s last point, “more a process of cultivation than it is one of extraction or refinement,” and his last recommendation:

…we need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely.

resonates the most with me.

Everyone can apply the same processes to oil and get out largely the same results.

Data on the other hand, cannot be processed or analyzed until some user assigns it values.

Data and the results of analysis of data, have value only because of the assignment of meaning by some user.

Assignment of meaning is fraught with peril, as we saw in K-Nearest Neighbors: dangerously simple.

You can turn the crank on big data, but the results will disappoint unless there is an understanding of the data.

I first saw this at: Big Data Is Not the New Oil

March 28, 2013

…The Analytical Sandbox [Topic Map Sandbox?]

Filed under: Analytics,Data,Data Analysis — Patrick Durusau @ 6:38 pm

Analytics Best Practices: The Analytical Sandbox by Rick Sherman.

From the post:

So this situation sounds familiar, and you are wondering if you need an analytical sandbox…

The goal of an analytical sandbox is to enable business people to conduct discovery and situational analytics. This platform is targeted for business analysts and “power users” who are the go-to people that the entire business group uses when they need reporting help and answers. This target group is the analytical elite of the enterprise.

The analytical elite have been building their own makeshift sandboxes, referred to as data shadow systems or spreadmarts. The intent of the analytical sandbox is to provide the dedicated storage, tools and processing resources to eliminate the need for the data shadow systems.

Rick outlines what he thinks is needed for an analytical sandbox.

What would you include in a topic map sand box?

March 20, 2013

Pyrallel – Parallel Data Analytics in Python

Filed under: Data Analysis,Parallel Programming,Programming,Python — Patrick Durusau @ 6:12 am

Pyrallel – Parallel Data Analytics in Python by Olivier Grisel.

From the webpage:

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
  • focus on small to medium data (with data locality when possible).
  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
  • do not focus on HA / Fault Tolerance (yet).
  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

This project brought to mind two things:

  1. Experimentation can lead to new approaches, such as “Think like a vertex.” (GraphLab: A Distributed Abstraction…), and
  2. A conference anecdote about a Python application written so the customer would need to upgrade for higher performance. Prototype performed so well the customer didn’t need the fuller version. I thought that was a tribute to Python and the programmer. Opinions differed.

March 2, 2013

Kepler Data Tutorial : What can you do?

Filed under: Astroinformatics,Data,Data Analysis — Patrick Durusau @ 4:55 pm

Kepler Data Tutorial : What can you do?

The Kepler mission was designed to hunt for planets orbiting foreign stars. When a planet passes between the Kepler satellite and its home star, the brightness of the light from the star dips.

That isn’t the only reason for changes in brightness but officially, Kepler has to ignore those other reasons. Unofficially, Kepler has encouraged professional and amateur astronomers to search the Kepler data for other reasons for light curves.

As I mentioned last year, Kepler Telescope Data Release: The Power of Sharing Data, a group of amateurs discovered the first system with four (4) suns and at least one (1) planet.

The Kepler Data Tutorial introduces you to analysis of this data set.

February 3, 2013

Text as Data:…

Filed under: Data Analysis,Text Analytics,Text Mining,Texts — Patrick Durusau @ 6:58 pm

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts by Justin Grimmer and Brandon M. Stewart.

Abstract:

Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation.

As a former political science major, I had to stop to read this article.

A wide ranging survey of an “exciting new area of research” but I remember content/text analysis as an undergraduate, North of forty years ago now.

True, some of the measures are new, along with better visualization techniques.

On the other hand, many of the problems of textual analysis now were the problems in textual analysis then (and before).

Highly recommended as a survey of current techniques.

A history of the “problems” of textual analysis and their resistance to various techniques will have to await another day.

Need to discover, access, analyze and visualize big and broad data? Try F#.

Filed under: Data Analysis,Data Mining,F#,Microsoft — Patrick Durusau @ 6:58 pm

Need to discover, access, analyze and visualize big and broad data? Try F#. by Oliver Bloch.

From the post:

Microsoft Research just released a new iteration of Try F#, a set of tools designed to make it easy for anyone – not just developers – to learn F# and take advantage of its big data, cross-platform capabilities.

F# is the open-source, cross-platform programming language invented by Don Syme and his team at Microsoft Research to help reduce the time-to-deployment for analytical software components in the modern enterprise.

Big data definitively is big these days and we are excited about this new iteration of Try F#. Regardless of your favorite language, or if you’re on a Mac, a Windows PC, Linux or Android, if you need to deal with complex problems, you will want to take a look at F#!

Kerry Godes from Microsoft’s Openness Initiative connected with Evelyne Viegas, Director of Semantic Computing at Microsoft Research, to find out more about how you can use “Try F# to seamlessly discover, access, analyze and visualize big and broad data.” For the complete interview, go to the Openness blog or check out www.tryfsharp.org to get started “writing simple code for complex problems”.

Are you an F# user?

Curious how F# compares to other languages for “complexity?”

Visualization gurus: Does the complexity of languages go up or down with the complexity of licensing terms?

Inquiring minds want to know. 😉

January 27, 2013

Information field theory

Filed under: Data Analysis,Information Field Theory,Mathematics,Uncertainty — Patrick Durusau @ 5:41 pm

Information field theory

From the webpage:

Information field theory (IFT) is information theory, the logic of reasoning under uncertainty, applied to fields. A field can be any quantity defined over some space, e.g. the air temperature over Europe, the magnetic field strength in the Milky Way, or the matter density in the Universe. IFT describes how data and knowledge can be used to infer field properties. Mathematically it is a statistical field theory and exploits many of the tools developed for such. Practically, it is a framework for signal processing and image reconstruction.

IFT is fully Bayesian. How else can infinitely many field degrees of freedom be constrained by finite data?

It can be used without the knowledge of Feynman diagrams. There is a full toolbox of methods.

It reproduces many known well working algorithms. This should be reassuring.

And, there were certainly previous works in a similar spirit. See below for IFT publications and previous works.

Anyhow, in many cases IFT provides novel rigorous ways to extract information from data.

Please, have a look! The specific literature is listed below and more general highlight articles on the right hand side.

Just in case you want to be on the cutting edge of information extraction. 😉

And you might note that Feynman diagrams are graphic representations (maps) of complex mathematical equations.

NIFTY: Numerical information field theory for everyone

NIFTY: Numerical information field theory for everyone

From the post:

Signal reconstruction algorithms can now be developed more elegantly because scientists at the Max Planck Institute for Astrophysics released a new software package for data analysis and imaging, NIFTY, that is useful for mapping in any number of dimensions or spherical projections without encoding the dimensional information in the algorithm itself. The advantage is that once a special method for image reconstruction has been programmed with NIFTY it can easily be applied to many other applications. Although it was originally developed with astrophysical imaging in mind, NIFTY can also be used in other areas such as medical imaging.

Behind most of the impressive telescopic images that capture events at the depths of the cosmos is a lot of work and computing power. The raw data from many instruments are not vivid enough even for experts to have a chance at understanding what they mean without the use of highly complex imaging algorithms. A simple radio telescope scans the sky and provides long series of numbers. Networks of radio telescopes act as interferometers and measure the spatial vibration modes of the brightness of the sky rather than an image directly. Space-based gamma ray telescopes identify sources by the pattern that is generated by the shadow mask in front of the detectors. There are sophisticated algorithms necessary to generate images from the raw data in all of these examples. The same applies to medical imaging devices, such as computer tomographs and magnetic resonance scanners.

Previously each of these imaging problems needed a special computer program that is adapted to the specifications and geometry of the survey area to be represented. But many of the underlying concepts behind the software are generic and ideally would just be programmed once if only the computer could automatically take care of the geometric details.

With this in mind, the researchers in Garching have developed and now released the software package NIFTY that makes this possible. An algorithm written using NIFTY to solve a problem in one dimension can just as easily be applied, after a minor adjustment, in two or more dimensions or on spherical surfaces. NIFTY handles each situation while correctly accounting for all geometrical quantities. This allows imaging software to be developed much more efficiently because testing can be done quickly in one dimension before application to higher dimensional spaces, and code written for one application can easily be recycled for use in another.

NIFTY stands for “Numerical Information Field Theory”. The relatively young field of Information Field Theory aims to provide recipes for optimal mapping, completely exploiting the information and knowledge contained in data. NIFTY now simplifies the programming of such formulas for imaging and data analysis, regardless of whether they come from the information field theory or from somewhere else, by providing a natural language for translating mathematics into software.

Your computer is more powerful than those used to develop generations of atomic bombs.

A wealth of scientific and other data is as close as the next Ethernet port.

Some of the best software in the world is available for free download.

So, what have you discovered lately?

NIFTY is a reminder that discovery is a question of will, not availability of resources.

NIFTY – Numerical Information Field Theory Documentation, download, etc.

From the NIFTY webpage:

NIFTY [1], “Numerical Information Field Theory”, is a versatile library designed to enable the development of signal inference algorithms that operate regardless of the underlying spatial grid and its resolution. Its object-oriented framework is written in Python, although it accesses libraries written in Cython, C++, and C for efficiency.

NIFTY offers a toolkit that abstracts discretized representations of continuous spaces, fields in these spaces, and operators acting on fields into classes. Thereby, the correct normalization of operations on fields is taken care of automatically without concerning the user. This allows for an abstract formulation and programming of inference algorithms, including those derived within information field theory. Thus, NIFTY permits its user to rapidly prototype algorithms in 1D and then apply the developed code in higher-dimensional settings of real world problems. The set of spaces on which NIFTY operates comprises point sets, n-dimensional regular grids, spherical spaces, their harmonic counterparts, and product spaces constructed as combinations of those.

I first saw this at: Software Package for All Types of Imaging, with the usual fun and games of running down useful links.

January 13, 2013

Outlier Analysis

Filed under: Data Analysis,Outlier Detection,Probability,Statistics — Patrick Durusau @ 8:15 pm

Outlier Analysis by Charu Aggarwal (Springer, January 2013). Post by Gregory Piatetsky.

From the post:

This is an authored text book on outlier analysis. The book can be considered a first comprehensive text book in this area from a data mining and computer science perspective. Most of the earlier books in outlier detection were written from a statistical perspective, and precede the emergence of the data mining field over the last 15-20 years.

Each chapter contains carefully organized content on the topic, case studies, extensive bibliographic notes and the future direction of research in this field. Thus, the book can also be used as a reference aid. Emphasis was placed on simplifying the content, so that the material is relatively easy to assimilate. The book assumes relatively little prior background, other than a very basic understanding of probability and statistical concepts. Therefore, in spite of its deep coverage, it can also provide a good introduction to the beginner. The book includes exercises as well, so that it can be used as a teaching aid.

Table of Contents and Introduction. Includes exercises and a 500+ reference bibliography.

Definitely a volume for the short reading list.

Caveat: As an outlier by any measure, my opinions here may be biased. 😉

January 11, 2013

Starting Data Analysis with Assumptions

Filed under: Data Analysis,Data Mining,Data Models — Patrick Durusau @ 7:33 pm

Why you don’t get taxis in Singapore when it rains? by Zafar Anjum.

From the post:

It is common experience that when it rains, it is difficult to get a cab in Singapore-even when you try to call one in or use your smartphone app to book one.

Why does it happen? What could be the reason behind it?

Most people would think that this unavailability of taxis during rain is because of high demand for cab services.

Well, Big Data has a very surprising answer for you, as astonishing as it was for researcher Oliver Senn.

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn did discover the reason for the patterns in the data, which is being addressed.

The first question should have been: Is this a big data problem?

True, Senn had lots of data to crunch, but that isn’t necessarily an indicator of a big data problem.

Interviews of a few taxi drivers would have dispelled the original assumption of high demand for taxis. It would also have led to the cause of the patterns Senn recognized.

That is the patterns were a symptom, not a cause.

I first saw this in So you want to be a (big) data hero? by Vinnie Mirchandani.

December 26, 2012

Educated Guesses Decorated With Numbers

Filed under: Data,Data Analysis,Open Data — Patrick Durusau @ 1:48 pm

Researchers Say Much to Be Learned from Chicago’s Open Data by Sam Cholke.

From the post:

HYDE PARK — Chicago is a vain metropolis, publishing every minute detail about the movement of its buses and every little skirmish in its neighborhoods. A team of researchers at the University of Chicago is taking that flood of data and using it to understand and improve the city.

“Right now we have more data than we’re able to make use of — that’s one of our motivations,” said Charlie Catlett, director of the new Urban Center for Computation and Data at the University of Chicago.

Over the past two years the city has unleashed a torrent of data about bus schedules, neighborhood crimes, 311 calls and other information. Residents have put it to use, but Catlett wants his team of computational experts to get a crack at it.

“Most of what is happening with public data now is interesting, but it’s people building apps to visualize the data,” said Catlett, a computer scientist at the university and Argonne National Laboratory.

Catlett and a collection of doctors, urban planners and social scientists want to analyze that data so to solve urban planning puzzles in some of Chicago’s most distressed neighborhoods and eliminate the old method of trial and error.

“Right now we look around and look for examples where something has worked or appeared to work,” said Keith Besserud, an architect at Skidmore, Owings and Merrill's Blackbox Studio and part of the new center. “We live in a city, so we think we understand it, but it’s really not seeing the forest for the trees, we really don’t understand it.”

Besserud said urban planners have theories but lack evidence to know for sure when greater density could improve a neighborhood, how increased access to public transportation could reduce unemployment and other fundamental questions.

“We’re going to try to break down some of the really tough problems we’ve never been able to solve,” Besserud said. “The issue in general is the field of urban design has been inadequately served by computational tools.”

In the past, policy makers would make educated guesses. Catlett hopes the work of the center will better predict such needs using computer models, and the data is only now available to answer some fundamental questions about cities.

…(emphasis added)

Some city services may be improved by increased data, such as staging ambulances near high density shooting locations based upon past experience.

That isn’t the same as “planning” to reduce the incidence of unemployment or crime by urban planning.

If you doubt that statement, consider the vast sums of economic data available for the past century.

Despite that array of data, there are no universally acclaimed “truths” or “policies” for economic planning.

The temptation to say “more data,” “better data,” “better integration of data,” etc. will solve problem X is ever present.

Avoid disappointing your topic map customers.

Make sure a problem is one data can help solve before treating it like one.

I first saw this in a tweet by Tim O’Reilly.

December 24, 2012

24 Christmas Gifts from is.R

Filed under: Data Analysis,R — Patrick Durusau @ 2:56 pm

24 Christmas Gifts from is.R by David Smith.

From the post:

The is.R blog has been on a roll in December with their Advent CalendaR feature: daily tips about R to unwrap each day leading up to Christmas. If you haven't been following it, start with today's post and scroll down. Sadly there isn't a tag to collect all these great posts together, but here are a few highlights:

A new to me blog, is.R, a great idea to copy for Christmas next year (posts on the Advent calendar), and high quality posts to enjoy!

Now that really is a bundle of Christmas joy!

Coursera’s Data Analysis with R course starts Jan 22

Filed under: CS Lectures,Data Analysis,R — Patrick Durusau @ 2:48 pm

Coursera’s Data Analysis with R course starts Jan 22 by David Smith.

From the post:

Following on from Coursera’s popular course introducing the R language, a new course on data analysis with R starts on January 22. The simply-titled Data Analysis course will provide practically-oriented instruction on how to plan, carry out, and communicate analyses of real data sets with R.

See also: Computing for Data Analysis course, which starts January 2nd.

Being sober by January 2nd is going to be a challenge but worth the effort. 😉

December 6, 2012

Advanced Data Analysis from an Elementary Point of View

Filed under: Data Analysis,Mathematics,Statistics — Patrick Durusau @ 11:35 am

Advanced Data Analysis from an Elementary Point of View by Cosma Rohilla Shalizi. (UPDATE: 2014 draft

From the Introduction:

These are the notes for 36-402, Advanced Data Analysis, at Carnegie Mellon. If you are not enrolled in the class, you should know that it’s the methodological capstone of the core statistics sequence taken by our undergraduate majors (usually in their third year), and by students from a range of other departments. By this point, they have taken classes in introductory statistics and data analysis, probability theory, mathematical statistics, and modern linear regression (“401”). This class does not presume that you have learned but forgotten the material from the pre-requisites; it presumes that you know that material and can go beyond it. The class also presumes a firm grasp on linear algebra and multivariable calculus, and that you can read and write simple functions in R. If you are lacking in any of these areas, now would be an excellent time to leave.

36-402 is a class in statistical methodology: its aim is to get students to understand something of the range of modern1 methods of data analysis, and of the considerations which go into choosing the right method for the job at hand (rather than distorting the problem to fit the methods the student happens to know). Statistical theory is kept to a minimum, and largely introduced as needed.

[Footnote 1] Just as an undergraduate “modern physics” course aims to bring the student up to about 1930 (more specifically, to 1926), this class aims to bring the student up to about 1990.

Very recent introduction to data analysis. Shalizi includes a list of concepts in the introduction that best be mastered before tackling this material.

According to footnote 1, when you have mastered this material, you have another twenty-two years to make up in general and on your problem in particular.

Still, knowing it cold will put you ahead of a lot of data analysis you are going to encounter.

I first saw this in a tweet by Gene Golovchinsky.

November 23, 2012

Tera-scale Astronomical Data Analysis and Visualization

Filed under: Astroinformatics,BigData,Data Analysis,Visualization — Patrick Durusau @ 11:27 am

Tera-scale Astronomical Data Analysis and Visualization by A. H. Hassan, C. J. Fluke, D. G. Barnes, V. A. Kilborn.

Abstract:

We present a high-performance, graphics processing unit (GPU)-based framework for the efficient analysis and visualization of (nearly) terabyte (TB)-sized 3-dimensional images. Using a cluster of 96 GPUs, we demonstrate for a 0.5 TB image: (1) volume rendering using an arbitrary transfer function at 7–10 frames per second; (2) computation of basic global image statistics such as the mean intensity and standard deviation in 1.7 s; (3) evaluation of the image histogram in 4 s; and (4) evaluation of the global image median intensity in just 45 s. Our measured results correspond to a raw computational throughput approaching one teravoxel per second, and are 10–100 times faster than the best possible performance with traditional single-node, multi-core CPU implementations. A scalability analysis shows the framework will scale well to images sized 1 TB and beyond. Other parallel data analysis algorithms can be added to the framework with relative ease, and accordingly, we present our framework as a possible solution to the image analysis and visualization requirements of next-generation telescopes, including the forthcoming Square Kilometre Array pathfinder radiotelescopes.

Looks like the original “big data” folks (astronomy) are moving up to analysis of near terabyte size images.

A glimpse of data and techniques that are rapidly approaching.

I first saw this in a tweet by Stefano Bertolo.

November 11, 2012

Introducing Wakari

Filed under: Data Analysis,Programming,Python — Patrick Durusau @ 1:30 pm

Introducing Wakari by Paddy Mullen.

From the post:

We are proud to introduce Wakari, our hosted Python data analysis environment.

We believe that programmers, scientists, and analysts should spend their time writing code, not working to setup a system. Data should be shareable, and analysis should be repeatable. We built Wakari to achieve these goals.

Sample Use Cases

We think Wakari will be useful for many people in all types of industries. Here are just three of the many use cases that Wakari will help for.

Learning python

If you want to learn Python, Wakari is the perfect environment. Wakari makes it easy to start writing code immediately, without needing to install any software on your own computer. You will be able to show instructors your code and get feedback as to where you’re getting hung up.

Academia

If you’re an academic frustrated by setting up computing environments and annoyed that your colleagues can’t easily run your code, Wakari is made for you. Wakari handles all of the problems related to setting up a Python scientific computing environment. Because Wakari builds on Anaconda, useful libraries like SciKit, mpi4py and NumPy are right at your fingertips without compilation gymnastics.

Since you run code on our servers through a web browser, it is easy for your colleagues to re-run your code to repeat your analysis, or try out variations on their own. At Continuum, we understand that reproducibility is an important part of the scientific process that your results be consistent for reviewers and colleagues.

Finance

(graphic omitted)

For users who work in finance, Wakari lets you avoid the drudgery of emailing Excel files to share analysis, data, and visuals. Since data feeds are integrated into the Python environment, it is effortless to import financial data into your coding environment. When it is time to share results, you can email colleagues a URL that links to running code. Interactive charts are easy to create and share from Python. Since Wakari is built on top of Anaconda, great libraries like NumPy, Scipy, Matplotlib, and Pandas are already installed. Wakari includes support for Anaconda’s multiple environments, so you can easily change between versions of Python (including Python 3.3!) and versions of fundamental libraries.

Interesting in part because Wakari further blurs the distinction between “your” computer and the “host.”

If you are performing analysis on data (assuming a high speed connection), does it really matter if “your” computer is running the analysis or simply displaying the results from some remote host?

Not a completely new concept for those of you who remember desktops that booted from servers.

Interesting as well as a model for how authoring aids for topic maps could be delivered (or at least their results) to topic map authors.

Want a concordance of text at Y location? Enter the URI. Want other NLP routines? Choose from this list. Separate and apart from any authoring engine. (Its called modularity.)

November 10, 2012

MDP – Modular toolkit for Data Processing

Filed under: Data Analysis,Python — Patrick Durusau @ 1:36 pm

MDP – Modular toolkit for Data Processing

From the webpage:

Modular toolkit for Data Processing (MDP) is a Python data processing framework.

From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.

From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.

The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.

If you are using Python, you might want give MDP a try.

I first saw this in a tweet by Chris@SocialTexture.

Fantasy Analytics

Filed under: Analytics,Data Analysis,Marketing,Users — Patrick Durusau @ 12:52 pm

Fantasy Analytics by Jeff Jonas.

From the post:

Sometimes it just amazes me what people think is computable given their actual observation space. At times you have to look them in the eye and tell them they are living in fantasyland.

Jeff’s post will have you rolling on the floor!

Except that you can think of several industry and government IT projects that would fit seamlessly into his narrative.

The TSA doesn’t need “bomb” written on the outside of your carry-on luggage. They have “observers” who are watching passengers to identify terrorists. Their score so far? 0.

Which means really clever terrorists are eluding these brooding “observers.”

The explanation could not be after spending $millions on training, salaries, etc., that the concept of observers spotting terrorists is absurd.

They might recognize a suicide vest but most TSA employees can do that.

I am printing out Jeff’s post to keep on my desk.

To share with clients who are asking for absurd things.

If they don’t “get it,” I can thank them for their time and move on to more intelligent clients.

Who will complain less about being specific, appreciate the results and be good references for future business.

I first saw this in a tweet by Jeffrey Carr.

October 29, 2012

Exploring Data in Engineering, the Sciences, and Medicine

Filed under: Data Analysis,R — Patrick Durusau @ 3:08 pm

Exploring Data in Engineering, the Sciences, and Medicine by Ronald Pearson.

From the description:

The recent dramatic rise in the number of public datasets available free from the Internet, coupled with the evolution of the Open Source software movement, which makes powerful analysis packages like R freely available, have greatly increased both the range of opportunities for exploratory data analysis and the variety of tools that support this type of analysis.

This book will provide a thorough introduction to a useful subset of these analysis tools, illustrating what they are, what they do, and when and how they fail. Specific topics covered include descriptive characterizations like summary statistics (mean, median, standard deviation, MAD scale estimate), graphical techniques like boxplots and nonparametric density estimates, various forms of regression modeling (standard linear regression models, logistic regression, and highly robust techniques like least trimmed squares), and the recognition and treatment of important data anomalies like outliers and missing data. The unique combination of topics presented in this book separate it from any other book of its kind.

Intended for use as an introductory textbook for an exploratory data analysis course or as self-study companion for professionals and graduate students, this book assumes familiarity with calculus and linear algebra, though no previous exposure to probability or statistics is required. Both simulation-based and real data examples are included, as are end-of-chapter exercises and both R code and datasets.

I encountered this while reading Characterizing a new dataset by the same author.

If you think of topic maps as a means to capture the results of exploration of data sets. Explorations by different explorers, possibly for different reasons, the results of data exploration become grist for a topic map mill.

There are no reader reviews at Amazon but I would be happy to correct that. 😉

October 15, 2012

Exploring Splunk: Search Processing Language (SPL) Primer and Cookbook

Filed under: Data Analysis,Searching,Splunk — Patrick Durusau @ 2:35 pm

Exploring Splunk: Search Processing Language (SPL) Primer and Cookbook by David Carraso.

From the webpage:

Splunk is probably the single most powerful tool for searching and exploring data you will ever encounter. Exploring Splunk provides an introduction to Splunk — a basic understanding of Splunk’s most important parts, combined with solutions to real-world problems.

Part I: Exploring Splunk

  • Chapter 1 tells you what Splunk is and how it can help you.
  • Chapter 2 discusses how to download Splunk and get started.
  • Chapter 3 discusses the search user interface and searching with Splunk.
  • Chapter 4 covers the most commonly used search commands.
  • Chapter 5 explains how to visualize and enrich your data with knowledge.

Part II: Solution Recipes

  • Chapter 6 covers the most common monitoring and alerting solutions.
  • Chapter 7 covers the most common transaction solutions.
  • Chapter 8 covers the most common lookup table solutions.

My Transaction Searching: Unifying Field Names post is based on an excerpt from this book.

You can download the book in ePub, pdf or Kindle versions or order a hardcopy.

Documentation that captures the interest of a reader.

Not that warns them software is going to be painful, even if in the long term beneficial.

Most projects could benefit from using “Exploring Splunk” as a model for introductory documentation.

September 25, 2012

Coursera’s free online R course starts today

Filed under: Data Analysis,R — Patrick Durusau @ 3:31 pm

Coursera’s free online R course starts today by David Smith

From the post:

Coursera offers a number of on-line courses, all available for free and taught by experts in their fields. Today, the course Computing for Data Analysis begins. Taught by Johns Hopkins Biostatistics professor (and co-author of the Simply Statistics blog) Roger Peng, the course will teach you how to program in R and use the language for data analysis. Here’s a brief introduction to the course:

(video omitted)

The course will run for the next 4 week, with a workload of 3-5 hours per week. You can sign up at the link below.

Coursera: Computing for Data Analysis

A day late but you can still register (I just did).

September 15, 2012

Wrapping Up TimesOpen: Sockets and Streams

Filed under: Data Analysis,Data Streams,node-js,Stream Analytics — Patrick Durusau @ 10:41 am

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

August 16, 2012

Mining of Massive Datasets [Revised – Mining Large Graphs Added]

Filed under: BigData,Data Analysis,Data Mining — Patrick Durusau @ 7:04 pm

Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

Version 1.0 errata frozen as of June 4, 2012.

Version 1.1 adds Jure Leskovec as a co-author and adds a chapter on mining large graphs.

Both versions can be downloaded as chapters or as entire text.

August 13, 2012

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Filed under: Data Analysis,MapReduce — Patrick Durusau @ 3:19 pm

Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc.

Vendor content so usual disclaimers apply but this may signal an important but subtle shift in computing environments.

From the post:

Introduction

In today’s competitive world, businesses need to make fast decisions to respond to changing market conditions and to maintain a competitive edge. The explosion of data that must be analyzed to find trends or hidden insights intensifies this challenge. Both the private and public sectors are turning to parallel computing techniques, such as “map/reduce” to quickly sift through large data volumes.

In some cases, it is practical to analyze huge sets of historical, disk-based data over the course of minutes or hours using batch processing platforms such as Hadoop. For example, risk modeling to optimize the handling of insurance claims potentially needs to analyze billions of records and tens of terabytes of data. However, many applications need to continuously analyze relatively small but fast-changing data sets measured in the hundreds of gigabytes and reaching into terabytes. Examples include clickstream data to optimize online promotions, stock trading data to implement trading strategies, machine log data to tune manufacturing processes, smart grid data, and many more.

Over the last several years, in-memory data grids (IMDGs) have proven their value in storing fast-changing application data and scaling application performance. More recently, IMDGs have integrated map/reduce analytics into the grid to achieve powerful, easy-to-use analysis and enable near real-time decision making. For example, the following diagram illustrates an IMDG used to store and analyze incoming streams of market and news data to help generate alerts and strategies for optimizing financial operations. This article explains how using an IMDG with integrated map/reduce capabilities can simplify data analysis and provide important competitive advantages.

Lowering the complexity of map/reduce, increasing operation speed (no file i/o), enabling easier parallelism, are all good things.

But they are differences in degree, not in kind.

I find IMDGs interesting because of the potential to increase the complexity of relationships between data, including data that is the output of operations.

From the post:

For example, an e-commerce Web site may need to monitor online shopping carts to see which products are selling.

Yawn.

That is probably a serious technical/data issue for Walmart or Home Depot, but it is a different in degree. You could do the same operations with a shoebox and paper receipts, although that would take a while.

Consider the beginning of something a bit more imaginative: What if sales at stores were treated differently than online shopping carts (due to delivery factors) and models built using weather forecasts three to five days out, time of year, local holidays and festivals? Multiple relationships between different data nodes.

That is just a back of an envelope sketch and I am sure successful retailers do even more than what I have suggested.

Complex relationships between data elements are almost at our fingertips.

Are you still counting shopping care items?

July 27, 2012

Anaconda: Scalable Python Computing

Filed under: Anaconda,Data Analysis,Machine Learning,Python,Statistics — Patrick Durusau @ 10:19 am

Anaconda: Scalable Python Computing

Easy, Scalable Distributed Data Analysis

Anaconda is a distribution that combines the most popular Python packages for data analysis, statistics, and machine learning. It has several tools for a variety of types of cluster computations, including MapReduce batch jobs, interactive parallelism, and MPI.

All of the packages in Anaconda are built, tested, and supported by Continuum. Having a unified runtime for distributed data analysis makes it easier for the broader community to share code, examples, and best practices — without getting tangled in a mess of versions and dependencies.

Good way to avoid dependency issues!

On scaling, I am reminded of a developer who designed a Python application to require upgrading for “heavy” use. Much to their disappointment, Python scaled under “heavy” use with no need for an upgrade. 😉

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

July 18, 2012

Computing for Data Analysis (Coursera course – R)

Filed under: Data Analysis,R — Patrick Durusau @ 11:21 am

Computing for Data Analysis by Roger D. Peng.

Description:

In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment, discuss generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, creating informative data graphics, accessing R packages, creating R packages with documentation, writing R functions, debugging, and organizing and commenting R code. Topics in statistical data analysis and optimization will provide working examples.

Readings:

The volume by Chambers looks comprehensive (500 or so pages) enough to be sufficient for the course.

Next Session: 24 September 2012 (4 weeks)
Workload: 3-5 hours per week

June 30, 2012

Guide to Intelligent Data Analysis

Filed under: Data Analysis — Patrick Durusau @ 6:48 pm

Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data, by Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. Series: Texts in Computer Science Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7.

Review snippet posted to book’s website:

“The clear and complete exposition of arguments, along with the attention to formalization and the balanced number of bibliographic references, make this book a bright introduction to intelligent data analysis. It is an excellent choice for graduate or advanced undergraduate courses, as well as for researchers and professionals who want get acquainted with this field of study. … Overall, the authors hit their target producing a textbook that aids in understanding the basic processes, methods, and issues for intelligent data analysis.” (Corrado Mencar, ACM Computing Reviews, April, 2011)

In some sense dated by not including the very latest improvement in the Hadoop ecosystem but all the more valuable for not focusing on ephemera. Rather it focuses on the principles of data analysis that are broadly applicable across data sets and tools.

The website includes slides and bibliographic references for use in teaching these materials.

I first saw this at KDNuggets.

« Newer PostsOlder Posts »

Powered by WordPress