Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 28, 2012

The R-Podcast

Filed under: R,Statistics — Patrick Durusau @ 8:41 pm

The R-Podcast

From the about page:

Whether you have experience with commercial statistical software such as SAS or SPSS and want to learn R, or getting into statistical computing for the first time, the R-Podcast will provide you with valuable information and advice that will help you to tap into the power of R. Our intent is to start with the basic concepts that can be a struggle for those new to R and statistical computing. We will give practical advice on how to take advantage of R’s capabilities to accomplish innovative and robust data analyses. Along the way we will highlight the additional tools and packages that greatly enhance the experience of using R, and highlight resources that can help people become experts with R. While this podcast is not meant to be a series of lectures on statistics, we will use freely and publicly available data sets to illustrate both basic statistical analyses as well as state-of-the-art algorithms to show how powerful and robust R can be for analyzing today’s explosion of data. In addition to the audio podcast, we will also produce screencasts for hands-on demonstrations for those topics that are best explained via video.

Your host:

The host of the R-Podcast is Eric Nantz, a statistician working in the life sciences industry who has been using R since 2004. Eric quickly developed a passion for using and learning more about R due in large part to the brilliant and exciting R community and its free and open source heritage, much like his favorite operating system Linux. Currently he is running Linux Mint and Ubuntu on many of his computers at home, each running R of course!

Hosts podcasts and resources about R.

Only two podcasts so far but it sounds like an interesting project.

February 27, 2012

Multivariate Statistical Analysis: Old School

Filed under: Biplots,Multivariate Statistics,Statistics — Patrick Durusau @ 8:24 pm

Multivariate Statistical Analysis: Old School by John Marden.

From the preface:

The goal of this text is to give the reader a thorough grounding in old-school multivariate statistical analysis. The emphasis is on multivariate normal modeling and inference, both theory and implementation. Linear models form a central theme of the book. Several chapters are devoted to developing the basic models, including multivariate regression and analysis of variance, and especially the “both-sides models” (i.e., generalized multivariate analysis of variance models), which allow modeling relationships among individuals as well as variables. Growth curve and repeated measure models are special cases.

The linear models are concerned with means. Inference on covariance matrices covers testing equality of several covariance matrices, testing independence and conditional independence of (blocks of) variables, factor analysis, and some symmetry models. Principal components, though mainly a graphical/exploratory technique, also lends itself to some modeling.

Classification and clustering are related areas. Both attempt to categorize individuals. Classification tries to classify individuals based upon a previous sample of observed individuals and their categories. In clustering, there is no observed categorization, nor often even knowledge of how many categories there are. These must be estimated from the data.

Other useful multivariate techniques include biplots, multidimensional scaling, and canonical correlations.

The bulk of the results here are mathematically justified, but I have tried to arrange the material so that the reader can learn the basic concepts and techniques while plunging as much or as little as desired into the details of the proofs.

Practically all the calculations and graphics in the examples are implemented using the statistical computing environment R [R Development Core Team, 2010]. Throughout the notes we have scattered some of the actual R code we used. Many of the data sets and original R functions can be found in the file http://www.istics.net/r/multivariateOldSchool.r. For others we refer to available R packages.

This is “old school.” A preface that contains useful information and outlines what the reader may find? Definitely “old school.”

Found thanks to: Christophe Lalanne’s A bag of tweets / Feb 2012.

Biplots in Practice

Filed under: Biplots,Statistics — Patrick Durusau @ 8:23 pm

Biplots in Practice by Michael Greenacre.

I was rather disappointed in the pricing information for the monographs on biplots cited in Christophe Lalanne’s Biplots. Particularly since most users would be new to biplots and reluctant to invest that kind of money in a monograph.

With a little searching I cam across this volume by Michael Greenacre, which is described as follows:

Biplots in Practice is a comprehensive introduction to one of the most useful and versatile methods of multivariate data visualization: the biplot. The biplot extends the idea of a simple scatterplot of two variables to the case of many variables, with the objective of visualizing the maximum possible amount of information in the data. Research data are typically presented in the form of a rectangular table and the biplot takes its name from the fact that it visualizes the rows and the columns of this table in a common space. This book explains the specific interpretation of the biplot in many different areas of multivariate analysis, notably regression, generalized linear modelling, principal component analysis, log-ratio analysis, various forms of correspondence analysis and discriminant analysis. It includes applications in many different fields of the social and natural sciences, and provides three detailed case studies documenting how the biplot reveals structure in large complex data sets in genomics (where thousands of variables are commonly encountered), in social survey research (where many categorical variables are studied simultaneously) and ecological research (where relationships between two sets of variables are investigated).

It is available online as well as a print publication.

The R code and other supplemental materials are available at this site.

In terms of promoting biplots, I think this is a step in the right direction.

February 22, 2012

Eurostat

Filed under: Data,Dataset,Government Data,Statistics — Patrick Durusau @ 4:48 pm

Eurostat

From the “about” page:

Eurostat’s mission: to be the leading provider of high quality statistics on Europe.

Eurostat is the statistical office of the European Union situated in Luxembourg. Its task is to provide the European Union with statistics at European level that enable comparisons between countries and regions.

This is a key task. Democratic societies do not function properly without a solid basis of reliable and objective statistics. On one hand, decision-makers at EU level, in Member States, in local government and in business need statistics to make those decisions. On the other hand, the public and media need statistics for an accurate picture of contemporary society and to evaluate the performance of politicians and others. Of course, national statistics are still important for national purposes in Member States whereas EU statistics are essential for decisions and evaluation at European level.

Statistics can answer many questions. Is society heading in the direction promised by politicians? Is unemployment up or down? Are there more CO2 emissions compared to ten years ago? How many women go to work? How is your country’s economy performing compared to other EU Member States?

International statistics are a way of getting to know your neighbours in Member States and countries outside the EU. They are an important, objective and down-to-earth way of measuring how we all live.

I have seen Eurostat mentioned, usually negatively, by data aggregation services. I visited Eurostat today and found it quite useful.

For the non-data professional, there are graphs and other visualizations of popular data.

For the data professional, there are bulk downloads of data and other technical information.

I am sure there is room for improvement specific feedback is required to make that happen. (It has been my experience that positive specific feedback works best. Fine something nice to say and then suggest a change to improve the outcome.)

January 21, 2012

January 18, 2012

Statistics 110: Introduction to Probability

Filed under: Mathematics,Statistics — Patrick Durusau @ 7:52 pm

Statistics 110: Introduction to Probability by Joseph Blitzstein.

Description:

Statistics 110 (Introduction to Probability), taught at Harvard University by Joe Blitzstein in Fall 2011. Lecture videos, homework, review material, practice exams, and a large collection of practice problems with detailed solutions are provided. This course is an introduction to probability as a language and set of tools for understanding statistics, science, risk, and randomness. The ideas and methods are useful in statistics, science, philosophy, engineering, economics, finance, and everyday life. Topics include the following. Basics: sample spaces and events, conditional probability, Bayes’ Theorem. Random variables and their distributions: cumulative distribution functions, moment generating functions, expectation, variance, covariance, correlation, conditional expectation. Univariate distributions: Normal, t, Binomial, Negative Binomial, Poisson, Beta, Gamma. Multivariate distributions: joint, conditional, and marginal distributions, independence, transformations, Multinomial, Multivariate Normal. Limit theorems: law of large numbers, central limit theorem. Markov chains: transition probabilities, stationary distributions, reversibility, convergence.

Like Michael Heise, I haven’t watched the lectures but I would appreciate hearing comments from anyone who does.

Particularly in an election year where people are going to be using (mostly abusing) statistics to influence your vote in city, county (parish in Louisiana), state and federal elections.

First seen at Statistics via iTunes by Michael Heise.

January 12, 2012

Probably Overthinking It

Filed under: Mathematics,Statistics — Patrick Durusau @ 7:28 pm

Probably Overthinking It: A blog by Allen Downey about statistics and probability.

If your work has any aspect of statistics/probability about it, you probably need to be reading this blog.

I commend it to topic mappers because claims about data are often expressed as statistics.

Not to mention that the results of statistics are subjects themselves, which you may wish to include in your topic map.

January 7, 2012

Statistical Rules of Thumb, Part III – Always Visualize the Data

Filed under: Data,Marketing,Statistics,Visualization — Patrick Durusau @ 4:05 pm

Statistical Rules of Thumb, Part III – Always Visualize the Data

From the post:

As I perused Statistical Rules of Thumb again, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).

Van Belle uses the term “Graph” rather than “Visualize”, but it is the same idea. The point is to visualize in addition to computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I’ve seen these as well, especially variables with outliers or that are bi- or tri-modal.

What techniques do you use in visualizing topic maps? Such as hiding topics or associations? Or coloring schemes that appear to work better than others? Or do you integrate the information delivered by the topic map with other visualizations? Such as street maps, blueprints or floor plans?

Post seen at: Data Mining and Predictive Analytics

January 5, 2012

Baltimore gun offenders and where academics don’t live

Filed under: Data Analysis,Geographic Data,Statistics — Patrick Durusau @ 4:06 pm

Baltimore gun offenders and where academics don’t live

An interesting plotting of the residential addresses (not crime locations) of gun offenders. You need to see the post to observe how stark the “island” of academics appears on the map.

Illustration of non-causation, unless you want to contend that the presence of academics in a neighborhood drives out gun offenders. Which would argue in favor of more employment and wider residential patterns for academics. I would favor that but suspect that is personal bias.

A cross between this map and a map of gun offenses would be a good guide for housing prospects in Baltimore.

What other data would be useful for such a map? Education, libraries, fire protection, other crime rates…. Easy enough since there are geographic boundaries as the binding points but “summing up” information as you zoom out might be interesting.

That is say crime statistics are on a police district basis and as you zoom out, you want information from multiple districts merged and resorted. Or you have overlapping districts for water, electricity, police, fire, etc. Having a geographic grid becomes your starting place but only a starting place.

December 28, 2011

400 Free Online Courses from Top Universities

Filed under: CS Lectures,Mathematics,Statistics — Patrick Durusau @ 9:37 pm

400 Free Online Courses from Top Universities

Just in case hard core math/cs stuff isn’t your cup of tea or you want to write topic maps about some other area of study, this may be a resource for you.

Oddly enough (?), every listing of free courses seems to be different from other listings of free courses.

If you happen to run across seminar lectures (graduate school) on Ancient or Medieval philosophy, drop me a line. Or even better, on individual figures.

I first saw this linked on John Johnson’s Realizations in Biostatistics. John was pointing to the statistics/math courses but there is a wealth of other material as well.

December 21, 2011

UseR! 2011 slides and videos – on one page

Filed under: Conferences,R,Statistics — Patrick Durusau @ 7:21 pm

UseR! 2011 slides and videos – on one page

From the post:

I was recently reminded that the wonderful team at warwick University made sure to put online many of the slides (and some videos) of talks from the recent useR 2011 conference. You can browse through the talks by going between the timetables (where it will be the most updated, if more slides will be added later), but I thought it might be more convenient for some of you to have the links to all the talks (with slides/videos) in one place.

I am grateful for all of the wonderful people who put their time in making such an amazing event (organizers, speakers, attendees), and also for the many speakers who made sure to share their talk/slides online for all of us to reference. I hope to see this open-slides trend will continue in the upcoming useR conferences…

Just in case you get a new R book over the holidays or even if you don’t, this is an amazing set of presentations. From business forecasting and medical imaging to social networks and modeling galaxies, something for everyone.

This looks like a very entertaining conference. Will watch for the announcement of next year’s conference.

December 16, 2011

Detecting Novel Associations in Large Data Sets

Filed under: Bioinformatics,Data Mining,Statistics — Patrick Durusau @ 8:23 am

Detecting Novel Associations in Large Data Sets by David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabeti.

Abstract:

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Lay version: Tool detects patterns hidden in vast data sets by Haley Bridger.

Data and software: http://exploredata.net/.

From the article:

Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones? Data sets of this size are increasingly common in fields as varied as genomics, physics, political science, and economics, making this question an important and growing challenge (1, 2).

One way to begin exploring a large data set is to search for pairs of variables that are closely associated. To do this, we could calculate some measure of dependence for each pair, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic we use to measure dependence should have two heuristic properties: generality and equitability.

By generality, we mean that with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships (3). The latter condition is desirable because not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function (4–7).

By equitability, we mean that the statistic should give similar scores to equally noisy relationships of different types. For example, we do not want noisy linear relationships to drive strong sinusoidal relationships from the top of the list. Equitability is difficult to formalize for associations in general but has a clear interpretation in the basic case of functional relationships: An equitable statistic should give similar scores to functional relationships with similar R2 values (given sufficient sample size).

Here, we describe an exploratory data analysis tool, the maximal information coefficient (MIC), that satisfies these two heuristic properties. We establish MIC’s generality through proofs, show its equitability on functional relationships through simulations, and observe that this translates into intuitively equitable behavior on more general associations. Furthermore, we illustrate that MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration. MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity. We demonstrate the application of MIC and MINE to data sets in health, baseball, genomics, and the human microbiota. (footnotes omitted)

As you can imagine the line:

MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity.

caught my eye.

I usually don’t post until the evening but this looks very important. I wanted everyone to have a chance to grab the data and software before the weekend.

New acronyms:

MIC – maximal information coefficient

MINE – maximal information-based nonparametric exploration

Good thing they chose acronyms we would not be likely to confuse with other usages. 😉

Full citation:

Science 16 December 2011:
Vol. 334 no. 6062 pp. 1518-1524
DOI: 10.1126/science.1205438

December 7, 2011

SP-Sem-MRL 2012

Filed under: Conferences,Parsing,Semantics,Statistics — Patrick Durusau @ 8:13 pm

ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012)

Important dates:

Submission deadline: March 31, 2012 (PDT, GMT-8)
Notification to authors: April 21, 2012
Camera ready copy: May 5, 2012
Workshop: TBD, during the ACL 2012 workshop period (July 12-14, 2012)

From the website:

Morphologically Rich Languages (MRLs) are languages in which grammatical relations such as Subject, Predicate, Object, etc., are indicated morphologically (e.g. through inflection) instead of positionally (as in, e.g. English), and the position of words and phrases in the sentence may vary substantially. The tight connection between the morphology of words and the grammatical relations between them, and the looser connection between the position and grouping of words to their syntactic roles, pose serious challenges for syntactic and semantic processing. Furthermore, since grammatical relations provide the interface to compositional semantics, morpho-syntactic phenomena may significantly complicate processing the syntax–semantics interface. In statistical parsing, which has been a cornerstone of research in NLP and had seen great advances due to the widespread availability of syntactically annotated corpora, English parsing performance has reached a high plateau in certain genres, which is however not always indicative of parsing performance in MRLs, dependency-based and constituency-based alike . Semantic processing of natural language has similarly seen much progress in recent years. However, as in parsing, the bulk of the work has concentrated on English, and MRLs may present processing challenges that the community is as of yet unaware of, and which current semantic processing technologies may have difficulty coping with. These challenges may lurk in areas where parses may be used as input, such as semantic role labeling, distributional semantics, paraphrasing and textual entailments, or where inadequate pre-processing of morphological variation hurts parsing and semantic tasks alike.

This joint workshop aims to build upon the first and second SPMRL workshops (at NAACL-HLT 2010 and IWPT 2011, respectively) while extending the overall scope to include semantic processing where MRLs pose challenges for algorithms or models initially designed to process English. In particular, we seek to explore the use of newly available syntactically and/or semantically annotated corpora, or data sets for semantic evaluation that can contribute to our understanding of the difficulty that such phenomena pose. One goal of this workshop is to encourage cross-fertilization among researchers working on different languages and among those working on different levels of processing. Of particular interest is work addressing the lexical sparseness and out-of-vocabulary (OOV) issues that occur in both syntactic and semantic processing.

The exploration of non-English languages will replicate many of the outstanding entity recognition/data integration problems experienced in English. Considering that there are massive economic markets that speak non-English languages, the first to make progress on such issues will have a commercial advantage. How much of one I suspect depends on how well your software works in a non-English language.

November 25, 2011

StatTrek

Filed under: Statistics — Patrick Durusau @ 4:22 pm

StatTrek

I saw this site referenced by the Analysis Factor when discussing calculation of the binomial distribution. Like the writer there, I just fell in love with the name.

There is a fair amount of advertising but that isn’t going to hurt you. Besides, the site has a number of useful resources.

November 15, 2011

Computational Statistics: An Introduction

Filed under: Computational Statistics,Statistics — Patrick Durusau @ 7:57 pm

Computational Statistics: An Introduction by James E. Gentle, Wolfgang Härdle, Yuichi Mori.

I suspect this to be:

Handbook of Computational Statistics (Volume I) Concepts and Methods
Gentle, James E.; Härdle, Wolfgang; Mori, Yuichi (Eds.)
2004, XII, 1070 p. 236 illus., Hardcover
ISBN: 3-540-40464-3
Language: English
Publisher: Springer-Verlag New York

(source: http://www.rmi.nus.edu.sg/csf/webpages/)

But there is no date or other information about the book on the webpages that I could find.

If it is the 2004 edition (seems likely), I doubt the basics of computational statistics have changed all that much. The only way to know for sure would be to get a copy of the second edition, if it issues, and compare the two.

If you make a comparison or have other information about this resource, please post a response.

I checked at the Spring site and they no longer list this work in the series. There are a nice selection of > $300 (US) books in the series if you are interested.

Data mining is going in the direction of greater reliance on computational analysis so a firm grounding in statistics (and their limitations) is a good skill set to have.

November 8, 2011

Statistical Learning Part III

Filed under: Statistical Learning,Statistics — Patrick Durusau @ 7:45 pm

Statistical Learning Part III by Steve Miller.

From the post:

I finally got around to cleaning up my home office the other day. The biggest challenge was putting away all the loose books in such a way that I can quickly retrieve them when needed.

In the clutter I found two copies of “The Elements of Statistical Learning” by Trevor Hastie, Robert Tibshirani and Jerome Friedman – one I purchased two years ago and the other I received at a recent Statistical Learning and Data Mining (SLDM III) seminar taught by first two authors. ESL is quite popular in the predictive modeling world, often referred to by aficionados as “the book”, “the SL book” or the “big yellow book” in reverence to its status as the SL bible.

Hastie, Tibshirani and Friedman are Professors of Statistics at Stanford University, the top-rated stats department in the country. For over 20 years, the three have been leaders in the field of statistical learning and prediction that sits between traditional statistical modeling and data mining algorithms from computer science. I was introduced to their work when I took the SLDM course three years ago.

Interesting discussion of statistical learning with Q/A session at the end.

November 5, 2011

Statsmodels

Filed under: Python,Statistics — Patrick Durusau @ 6:44 pm

Statsmodels

From the webpage:

scikits.statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are avalable for each estimator. The results are tested against existing statistical packages to ensure that they are correct. The pacakge is released under the open source Simplied BSD (2-clause) license.

November 2, 2011

GPUStats

Filed under: CUDA,Parallel Programming,Statistics — Patrick Durusau @ 6:25 pm

GPUStats

If you need to access a NVIDIA CUDA interface for statistical calculations, GPUStats may be of assistance.

From the webpage:

gpustats is a PyCUDA-based library implementing functionality similar to that present in scipy.stats. It implements a simple framework for specifying new CUDA kernels and extending existing ones. Here is a (partial) list of target functionality:

  • Probability density functions (pdfs). These are intended to speed up likelihood calculations in particular in Bayesian inference applications, such as in PyMC
  • Random variable generation using CURAND

October 31, 2011

Statistics with R

Filed under: R,Statistics — Patrick Durusau @ 7:31 pm

Statistics with R

From the webpage:

Here are the notes I took while discovering and using the statistical environment R. However, I do not claim any competence in the domains I tackle: I hope you will find those notes useful, but keep you eyes open — errors and bad advice are still lurking in those pages…

Another statistics with R guide. If you use more than one learning R, would appreciate your comparison/comments.

October 12, 2011

Statistical Computing

Filed under: R,Statistics — Patrick Durusau @ 4:37 pm

Statistical Computing

From the webpage:

Description

Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.

Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.

The class will be taught in the R language.
Carnegie Mellon

Statistics this Fall at Carnegie Mellon. Lectures, Homework, Labs.

Whether you simply missed statistics or it has been a while, this looks very good.

Top 50 Statistics Blogs

Filed under: Mathematics,Statistics — Patrick Durusau @ 4:36 pm

Top 50 Statistics Blogs

From the post:

Statistics is a branch of mathematics that deals with the interpretation of data. Statisticians work in a wide variety of fields in both the private and the public sectors. They are teachers, consultants, watchdogs, journalists, designers, programmers, and by in large, ordinary people like you and me. And some of them blog.

In searching for the top statistics blogs on the web we only considered blogs that have been active in 2011. In deciding which ones to include in our (admittedly unscientific) list of the 50 best statistics blogs we considered a range of factors, including visual appeal/aesthetics, frequency of posts, and accessibility to non-specialists. Our goal is to highlight blogs that students and prospective students will find useful and interesting in their exploration of the field.

I’m not quite sure of the reason for the explanation of statistics at the head of a list of the top 50 statistics blogs but it isn’t a serious defect.

(I first saw this at www.r-bloggers.org.)

September 6, 2011

Electronic Statistics Textbook

Filed under: Mathematics,Statistics — Patrick Durusau @ 7:02 pm

Electronic Statistics Textbook

From the website:

The only Internet Resource about Statistics Recommended by Encyclopedia Britannica

StatSoft has freely provided the Electronic Statistics Textbook as a public service for more than 12 years now.

This Textbook offers training in the understanding and application of statistics. The material was developed at the StatSoft R&D department based on many years of teaching undergraduate and graduate statistics courses and covers a wide variety of applications, including laboratory research (biomedical, agricultural, etc.), business statistics, credit scoring, forecasting, social science statistics and survey research, data mining, engineering and quality control applications, and many others.

The Electronic Textbook begins with an overview of the relevant elementary (pivotal) concepts and continues with a more in depth exploration of specific areas of statistics, organized by “modules” and accessible by buttons, representing classes of analytic techniques. A glossary of statistical terms and a list of references for further study are included.

Proper citation
(Electronic Version): StatSoft, Inc. (2011). Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB: http://www.statsoft.com/textbook/. (Printed Version): Hill, T. & Lewicki, P. (2007). STATISTICS: Methods and Applications. StatSoft, Tulsa, OK.

This is going to get a bookmark for sure!

July 12, 2011

MADlib goes beta!

Filed under: Data Analysis,SQL,Statistics — Patrick Durusau @ 7:08 pm

MADlib goes beta! Serious in-database analytics

From the post:

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

Kudos to EMC:

And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort. I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.

Not every acquisition has that happy result.

June 3, 2011

Red-R – Pipeline Visual Editor for Doing Stats With R

Filed under: R,Statistics — Patrick Durusau @ 2:31 pm

Red-R – Pipeline Visual Editor for Doing Stats With R

Overview of Red-R, visual editor for R.

From the post:

To ease my way into R, I’ve started using R-Studio, an in-development IDE. But the other day, I was also tipped off about Red-R, a visual programming environment for R that seems to be built around the same tooling as the Orange data analysis tool I wrote about last year.

It’s still pretty ropey at the moment (on a Mac at least), but works enough to be going on with…

The metaphor is based on pipeline processing of data, chaining together functional blocks with wires in the order you want the functions to be executed. Getting data in is currently from a file (it would be nice to see hooks into online datasources supported too), with a range of options for getting the data into the environment in a structured way:….

How would you visual topic map processing as topics, etc., encounter constraints or are formed by queries? What of items that are discarded or not selected? Thinking of something along the lines of interactive creation/destruction of topics along with merging of the same.

May 18, 2011

Practical Machine Learning

Filed under: Algorithms,Machine Learning,Statistical Learning,Statistics — Patrick Durusau @ 6:45 pm

Practical Machine Learning, by Michael Jordan (UC Berkeley).

From the course webpage:

This course introduces core statistical machine learning algorithms in a (relatively) non-mathematical way, emphasizing applied problem-solving. The prerequisites are light; some prior exposure to basic probability and to linear algebra will suffice.

This is the Michael Jordan who gave a Posner Lecture at the 24th Annual Conference on Neural Information Processing Systems 2010.

March 19, 2011

Think Stats: Probability and Statistics for Programmers

Filed under: Python,Statistics — Patrick Durusau @ 5:57 pm

Think Stats: Probability and Statistics for Programmers by Allen B. Downey

From the website:

Think Stats is an introduction to Probability and Statistics for Python programmers.

If you have basic skills in Python, you can use them to learn concepts in probability and statistics. This new book emphasizes simple techniques you can use to explore real data sets and answer interesting statistical questions.

Important no only for data set exploration, preparatory to building topic maps but also for evaluating the statistics offered by others.

Recalling that figures don’t lie but all liars can figure.

March 4, 2011

Third Cross Validated Journal Club

Filed under: Data Mining,Statistics — Patrick Durusau @ 6:08 am

Third Cross Validated Journal Club

From the posting:

  • CVJC is a whole day meeting on chat where we discuss some paper and its theoretical/practical surroundings.
  • As mentioned above the event is whole-day (00:00-23:59UTC), but there are three meet-up sessions at 1:00, 9:00 and 16:00UTC on which most talking take place; they are spread over day to put at least one CVJC session in reach regardless of time zone.
  • The paper must be OpenAccess or a (p)reprint suggested previously on a meta thread like this one and selected in voting.
  • I would try to invite the author (it worked last time).

See the posting for the proposal for the next Cross Validated meeting date and discussion material.

Thinking something like this could be of interest in the topic maps community.

Cross Validated

Filed under: Data Mining,Statistics,Visualization — Patrick Durusau @ 5:58 am

Cross Validated

From the website:

This is a collaboratively edited question and answer site for statisticians, data analysts, data miners and data visualization experts. It’s 100% free, no registration required.

This is one of a series of such Q/A sites that I am going to be listing as of possible interest to the topic maps community.

January 22, 2011

40 Fascinating Blogs for the Ultimate Statistics Geek – Post

Filed under: Data Mining,Statistics — Patrick Durusau @ 1:29 pm

40 Fascinating Blogs for the Ultimate Statistics Geek

A varied collection of blogs on statistics.

Either for data mining, modeling or interpreting the data mining/modeling of others, you are going to need statistics.

Blogs are not a replacement for a good statistics book and a copy of Mathematica but it’s a place to start.

December 30, 2010

The Joy of Stats

Filed under: Data Analysis,Statistics,Visualization — Patrick Durusau @ 7:56 am

The Joy Of Stats Available In Its Entirety

I am not sure that “…statistics are the sexiest subject around…” but if anyone could make it appear to be so, it would be Rosling.

Highly recommended for an entertaining account of statistics and data visualization.

You won’t learn the latest details but you will be left with an enthusiasm for incorporating such techniques in your next topic map.

BTW, does anyone know of a video editor/producer who would be interested in volunteering to film/produce The Joy of Topic Maps?

(I suppose the script would have to be written first. 😉 )

« Newer Posts

Powered by WordPress