## Archive for the ‘Statistics’ Category

### Harvard Stat 221

Friday, May 10th, 2013

Harvard Stat 221 “Statistical Computing and Visualization”: by Sergiy Nesterko.

From the post:

Stat 221 is Statistical Computing and Visualization. It’s a graduate class on analyzing data without losing scientific rigor, and communicating your work. Topics span the full cycle of a data-driven project including project setup, design, implementation, and creating interactive user experiences to communicate ideas and results. We covered current theory and philosophy of building models for data, computational methods, and tools such as d3js, parallel computing with MPI, R.

See Sergily’s post for the lecture slides from this course.

### Povcalnet – World Bank Poverty Stats

Sunday, May 5th, 2013

I’m surprised some Republican in the U.S. House or Senate isn’t citing Povcalnet as evidence there is no poverty in the United States.

The trick of course is in how you define “poverty.”

The World Bank uses $1,$1.25 and $2.00 a day as poverty lines. While there is widespread global hunger and disease, is income sufficient to participate in the global economy really the best measure for poverty? If the documentaries are to be believed, there are tribes of Indians who live in the rain forests of Brazil, quite healthily, without any form of money at all. They are not buying iPods with foreign music to replace their own but that isn’t being impoverished. Is it? There is the related issue that someone else is classifying people as impoverished. I wonder how they would classify themselves? Statistics could be made more transparent through the use of topic maps. ### SemStats 2013 Thursday, May 2nd, 2013 First International Workshop on Semantic Statistics (SemStats 2013) Deadline for paper submission: Friday, 12 July 2013, 23:59 (Hawaii time) Notification of acceptance/rejection: Friday, 9 August 2013 Deadline for camera-ready version: Friday, 30 August 2013 From the call for papers: The goal of this workshop is to explore and strengthen the relationship between the Semantic Web and statistical communities, to provide better access to the data held by statistical offices. It will focus on ways in which statisticians can use Semantic Web technologies and standards in order to formalize, publish, document and link their data and metadata. The statistical community has recently shown an interest in the Semantic Web. In particular, initiatives have been launched to develop semantic vocabularies representing statistical classifications and discovery metadata. Tools are also being created by statistical organizations to support the publication of dimensional data conforming to the Data Cube specification, now in Last Call at W3C. But statisticians see challenges in the Semantic Web: how can data and concepts be linked in a statistically rigorous fashion? How can we avoid fuzzy semantics leading to wrong analyses? How can we preserve data confidentiality? The workshop will also cover the question of how to apply statistical methods or treatments to linked data, and how to develop new methods and tools for this purpose. Except for visualisation techniques and tools, this question is relatively unexplored, but the subject will obviously grow in importance in the near future. An unfortunate emphasis on linked data before understanding the problem of imbuing statistical data with semantics. Studying the needs of the statistical community for semantics and to what degree would be more likely to yield useful requirements. And from requirements, then to proceed to find appropriate solutions. As opposed to arriving solution in hand, with saws, pry bars, shoe horns and similar tools for affixing the solution to any problem. ### Does statistics have an ontology? Does it need one? (draft 2) Tuesday, April 16th, 2013 From the post: Chance, rational beliefs, decision, uncertainty, probability, error probabilities, truth, random sampling, resampling, opinion, expectations. These are some of the concepts we bandy about by giving various interpretations to mathematical statistics, to statistical theory, and to probabilistic models. But are they real? The question of “ontology” asks about such things, and given the “Ontology and Methodology” conference here at Virginia Tech (May 4, 5), I’d like to get your thoughts (for possible inclusion in a Mayo-Spanos presentation).* Also, please consider attending**. Interestingly, I noticed the posts that have garnered the most comments have touched on philosophical questions of the nature of entities and processes behind statistical idealizations (e.g.,http://errorstatistics.com/2012/10/18/query/). The post and ensuing comments offer much to consider. From my perspective, if assumptions, ontological and otherwise, go unstated, the results opaque. You can accept them, because they fit your prior opinion or how you wanted the results to be, or reject them as not fitting your prior opinion or desired result. ### Probability and Statistics Cookbook Friday, April 5th, 2013 Probability and Statistics Cookbook by Matthias Vallentin. From the webpage: The cookbook contains a succinct representation of various topics in probability theory and statistics. It provides a comprehensive reference reduced to the mathematical essence, rather than aiming for elaborate explanations. When Matthias says “succient,” he is quite serious: But by the time you master the twenty-seven pages of this “cookbook,” you will have a very good grounding on probability and statistics. ### 100 Savvy Sites on Statistics and Quantitative Analysis Wednesday, April 3rd, 2013 100 Savvy Sites on Statistics and Quantitative Analysis From the post: Nate Silver’s unprecedented accurate prediction of state-by-state election results in the most recent presidential race was a watershed moment for the public awareness of statistics. While data gathering and analysis has become a massive industry in the past decade, it hasn’t always been as well covered in the press or publicly accessible as it is now. With more and more of our daily interactions being mediated through computers and the internet, it is easier than ever to gather detailed quantitative data and do statistical analysis on that data derive valuable information and predictions from it. Knowledge of statistics and quantitative analysis techniques is more valuable than ever. From biostatisticians to politicians and economists, people in every field are using statistics to further their careers and knowledge. These sites are some of the most useful, informative, and comprehensive on the web covering stats and quantitative analysis. Covers everything from Comprehensive Statistics Sites and Big Data to Data Visualization and Sports Stats. Fire up your alternative to Google Reader! I first saw this at 100 Savvy Sites on Statistics and Quantitative Analysis by Vincent Granville. ### An Applet for the Investigation of Simpson’s Paradox Tuesday, April 2nd, 2013 An Applet for the Investigation of Simpson’s Paradox by Kady Schneiter and Jürgen Symanzik. (Journal of Statistics Education, Volume 21, Number 1 (2013)) Simpson’s paradox is best illustrated by the University of California, Berkeley sex discrimination case. Taken in the aggregate, admissions to the graduate school appeared to greatly favor men. Taken by department, no department discriminated against women and most favored admission of women. Same data, different level of examination. That is Simpson’s paradox. Abstract: This article describes an applet that facilitates investigation of Simpson’s Paradox in the context of a number of real and hypothetical data sets. The applet builds on the Baker-Kramer graphical representation for Simpson’s Paradox. The implementation and use of the applet are explained. This is followed by a description of how the applet has been used in an introductory statistics class and a discussion of student responses to the applet. From Wikipedia on Simpson’s Paradox: In probability and statistics, Simpson’s paradox, or the Yule–Simpson effect, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in social-science and medical-science statistics,[1] and is particularly confounding when frequency data are unduly given causal interpretations.[2] Simpson’s Paradox disappears when causal relations are brought into consideration. A cautionary tale about the need to understand data sets and how combining them may impact outcomes of statistical analysis. ### Journal of Statistics Education Tuesday, April 2nd, 2013 Journal of Statistics Education From the mission statement: The Journal of Statistics Education (JSE) disseminates knowledge for the improvement of statistics education at all levels, including elementary, secondary, post-secondary, post-graduate, continuing, and workplace education. It is distributed electronically and, in accord with its broad focus, publishes articles that enhance the exchange of a diversity of interesting and useful information among educators, practitioners, and researchers around the world. The intended audience includes anyone who teaches statistics, as well as those interested in research on statistical and probabilistic reasoning. All submissions are rigorously refereed using a double-blind peer review process. Manuscripts submitted to the journal should be relevant to the mission of JSE. Possible topics for manuscripts include, but are not restricted to: curricular reform in statistics, the use of cooperative learning and projects, innovative methods of instruction, assessment, and research (including case studies) on students’ understanding of probability and statistics, research on the teaching of statistics, attitudes and beliefs about statistics, creative and tested ideas (including experiments and demonstrations) for teaching probability and statistics topics, the use of computers and other media in teaching, statistical literacy, and distance education. Articles that provide a scholarly overview of the literature on a particular topic are also of interest. Reviews of software, books, and other teaching materials will also be considered, provided these reviews describe actual experiences using the materials. In addition JSE also features departments called “Teaching Bits: A Resource for Teachers of Statistics” and “Datasets and Stories.” “Teaching Bits” summarizes interesting current events and research that can be used as examples in the statistics classroom, as well as pertinent items from the education literature. The “Datasets and Stories” department not only identifies interesting datasets and describes their useful pedagogical features, but enables instructors to download the datasets for further analysis or dissemination to students. Associated with the Journal of Statistics Education is the JSE Information Service. The JSE Information Service provides a source of information for teachers of statistics that includes the archives of EDSTAT-L (an electronic discussion list on statistics education), information about the International Association for Statistical Education, and links to many other statistics education sources. If you are going to talk about big data, of necessity you are also going to talk about statistics. A very good free online resource on statistics. ### Using R For Statistical Analysis – Two Useful Videos Saturday, March 30th, 2013 Using R For Statistical Analysis – Two Useful Videos by Bruce Berriman. Bruce has uncovered two interesting videos on using R: An Introduction to R for Data Mining by Joseph Rickert. (Recording of the webinar by the same name.) Bruce has additional links that will be useful with the videos. Enjoy! ### The Artful Business of Data Mining… Friday, March 29th, 2013 David Coallier has two presentations under that general title: Distributed Schema-less Document-Based Databases and, Computational Statistics with Open Source Tools Neither one of which is a “…death by powerpoint…” type presentation where the speaker reads text you can read for yourself. Which is good, except that with minimal slides, you get an occasional example, names of software/techniques, but you have to fill in a lot of context. A pointer to videos of either of these presentations would be greatly appreciated! ### RDF Data Cube Vocabulary [Last Call ends 08 April 2013] Tuesday, March 12th, 2013 RDF Data Cube Vocabulary Abstract: There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets. If you have comments, now would be a good time to finish them up for submission. I first saw this in a tweet by Sandro Hawke. ### R Bootcamp Materials! Monday, February 25th, 2013 R Bootcamp Materials! by Jared Knowles. From the post: To train new employees at the Wisconsin Department of Public Instruction, I have developed a 2-3 day series of training modules on how to get work done in R. These modules cover everything from setting up and installing R and RStudio to creating reproducible analyses using the knitr package. There are also some experimental modules for introductions to basic computer programming, and a refresher course on statistics. I hope to improve both of these over time. I am happy to announce that all of these materials are available online, for free. ​The bootcamp covers the following topics: 1. Introduction to R​ : History of R, R as a programming language, and features of R. 2. Getting Data In :​ How to import data into R, manipulate, and manage multiple data objects. 3. Sorting and Reshaping Data : ​Long to wide, wide to long, and everything in between! 4. Cleaning Education Data​ : Includes material from the Strategic Data Project about how to implement common business rules in processing administrative data. 5. Regression and Basic Analytics in R​ : Using school mean test scores to do OLS regression and regression diagnostics — a real world example. 6. Visualizing Data : ​Harness the power of R’s data visualization packages to make compelling and informative visualizations. 7. Exporting Your Work : ​Learn the knitr​ package, and how to export graphics, and create PDF reports. 8. Advanced Topics :​ A potpourri of advanced features in R (by request) 9. A Statistics Refresher : ​With interactive examples using shiny​ 10. ​Programming Principles : ​Tips and pointers about writing code. (Needs work) The best part is, all of the materials are available online and free of charge! (Check out the R Bootcamp page). They are constantly evolving. We have done two R Bootcamps so far, and hope to do more. Each time the materials get a little better. ​ The R Bootcamp page enables you to download all the materials or view the modules separately. If you already know R, pass it on. ### VOStat: A Statistical Web Service… [Open Government, Are You Listening?] Monday, February 18th, 2013 VOStat: A Statistical Web Service for Astronomers From the post: VOStat is a simple statistical web service that lets you analyze your data without the hassle of downloading or installing any software. VOStat provides interactive statistical analysis of astronomical tabular datasets. It is integrated into the suite of analysis and visualization tools associated with the Virtual Observatory (VO) through the SAMP communication system. A user supplies VOStat with a dataset and chooses among ~60 statistical functions, including data transformations, plots and summaries, density estimation, one- and two-sample hypothesis tests, global and local regressions, multivariate analysis and clustering, spatial analysis, directional statistics, survival analysis , and time series analysis. VOStat was developed by the Center for Astrostatistics (Pennsylvania State University). The astronomical community has data sets that dwarf any open government data set and they have ~ 60 statistical functions? Whereas in open government data, dumping data files to public access is considered being open? The technology to do better already exists. So, what is your explanation for defining openness as “data dumps to the web?” PS: Have you ever thought about creating a data interface that hold mappings between data sets, such as a topic map would produce? Would papering over agency differences in terminology assist users in taking advantage of their data sets? (Subject to disclosure that is happening.) Would you call that a “TMStat: A Topic Map Statistical Web Service?” (Disclosure of the basis for mapping being what distinguishes a topic map statistical web service from a fixed mapping between undefined column headers in different tables.) ### NBA Stats Like Never Before [No RDF/Linked Data/Topic Maps In Sight] Saturday, February 16th, 2013 NBA Stats Like Never Before by Timo Elliott. From the post: The National Baseball Association today unveiled a new site for fans of games statistics: NBA.com/stats, powered by SAP Analytics technology. The multi-year marketing partnership between SAP and the NBA was announced six months ago: “We are constantly researching new and emerging technologies in an effort to provide our fans with new ways to connect with our game,” said NBA Commissioner David Stern. “SAP is a leader in providing innovative software solutions and an ideal partner to provide a dynamic and comprehensive statistical offering as fans interact with NBA basketball on a global basis.” “SAP is honored to partner with the NBA, one of the world’s most respected sports organizations,” said Bill McDermott, co-CEO, SAP. “Through SAP HANA, fans will be able to experience the NBA as never before. This is a slam dunk for SAP, the NBA and the many fans who will now have access to unprecedented insight and analysis.” The free database contains every box score of every game played since the league’s inception in 1946, including graphical displays of players shooting tendencies. To the average fan NBA.com/Stats delivers information that is of immediate interest to them, not their computers. Another way to think about it: Computers don’t make purchasing decisions, users do. Something to think about when deciding on your next semantic technology. ### Outlier Analysis Sunday, January 13th, 2013 Outlier Analysis by Charu Aggarwal (Springer, January 2013). Post by Gregory Piatetsky. From the post: This is an authored text book on outlier analysis. The book can be considered a first comprehensive text book in this area from a data mining and computer science perspective. Most of the earlier books in outlier detection were written from a statistical perspective, and precede the emergence of the data mining field over the last 15-20 years. Each chapter contains carefully organized content on the topic, case studies, extensive bibliographic notes and the future direction of research in this field. Thus, the book can also be used as a reference aid. Emphasis was placed on simplifying the content, so that the material is relatively easy to assimilate. The book assumes relatively little prior background, other than a very basic understanding of probability and statistical concepts. Therefore, in spite of its deep coverage, it can also provide a good introduction to the beginner. The book includes exercises as well, so that it can be used as a teaching aid. Table of Contents and Introduction. Includes exercises and a 500+ reference bibliography. Definitely a volume for the short reading list. Caveat: As an outlier by any measure, my opinions here may be biased. ### Missing-Data Imputation Saturday, December 29th, 2012 From the post: Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the$89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with their particular implementation, which is one reason we’re developing our own package); he supplies lots of intuition, examples, and graphs.

Steve Newcomb makes the point that data is dirty. Always.

Stef van Buuren suggests that data may be missing and requires imputation.

Together that means dirty data may be missing and requires imputation.

Imputed or not, data is no more reliable than we are. Use with caution.

### Analyzing the Enron Data…

Saturday, December 29th, 2012

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun .

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

• Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
• Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
• Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

### Analyzing Categorical Data

Saturday, December 29th, 2012

Analyzing Categorical Data by Jeffrey S. Simonoff.

Mentioned in My Intro to Multiple Classification… but thought it merited a more prominent mention.

From the webpage:

Welcome to the web site for the book Analyzing Categorical Data, published by Springer-Verlag in July 2003 as part of the Springer Texts in Statistics series. This site allows access to the data sets used in the book, S-PLUS/R and SAS code to perform the analyses in the book, some general information on statistical software for analyzing categorical data, and an errata list. I would be very happy to receive comments on this site, and on the book itself.

Data sets, code to duplicate the analysis in the book and other information at this site.

### Advanced Data Analysis from an Elementary Point of View

Thursday, December 6th, 2012

Advanced Data Analysis from an Elementary Point of View by Cosma Rohilla Shalizi.

From the Introduction:

These are the notes for 36-402, Advanced Data Analysis, at Carnegie Mellon. If you are not enrolled in the class, you should know that it’s the methodological capstone of the core statistics sequence taken by our undergraduate majors (usually in their third year), and by students from a range of other departments. By this point, they have taken classes in introductory statistics and data analysis, probability theory, mathematical statistics, and modern linear regression (“401”). This class does not presume that you have learned but forgotten the material from the pre-requisites; it presumes that you know that material and can go beyond it. The class also presumes a ﬁrm grasp on linear algebra and multivariable calculus, and that you can read and write simple functions in R. If you are lacking in any of these areas, now would be an excellent time to leave.

36-402 is a class in statistical methodology: its aim is to get students to understand something of the range of modern1 methods of data analysis, and of the considerations which go into choosing the right method for the job at hand (rather than distorting the problem to ﬁt the methods the student happens to know). Statistical theory is kept to a minimum, and largely introduced as needed.

[Footnote 1] Just as an undergraduate “modern physics” course aims to bring the student up to about 1930 (more speciﬁcally, to 1926), this class aims to bring the student up to about 1990.

Very recent introduction to data analysis. Shalizi includes a list of concepts in the introduction that best be mastered before tackling this material.

According to footnote 1, when you have mastered this material, you have another twenty-two years to make up in general and on your problem in particular.

Still, knowing it cold will put you ahead of a lot of data analysis you are going to encounter.

I first saw this in a tweet by Gene Golovchinsky.

### The Elements of Statistical Learning (2nd ed.)

Wednesday, December 5th, 2012

The Elements of Statistical Learning (2nd ed.) by Trevor Hastie, Robert Tibshirani and Jerome Friedman. (PDF)

The authors note in the preface to the first edition:

The field of Statistics is constantly challenged by the problems that science and industry brings to its door. In the early days, these problems often came from agricultural and industrial experiments and were relatively small in scope. With the advent of computers and the information age, statistical problems have exploded both in size and complexity. Challenges in the areas of data storage, organization and searching have led to the new field of “data mining”; statistical and computational problems in biology and medicine have created “bioinformatics.” Vast amounts of data are being generated in many fields, and the statistician’s job is to make sense of it all: to extract important patterns and trends, and understand “what the data says.” We call this learning from data.

I’m sympathetic to that sentiment but with the caveat that it is our semantic expectations of the data that give it any meaning to be “learned.”

Data isn’t lurking outside our door with “meaning” captured separate and apart from us. Our fancy otherwise obscures our role in the origin of “meaning” that we attach to data. In part to bolster the claim that the “facts/data say….”

It is us who take up the gauge for our mute friends, facts/data, and make claims on their behalf.

If we recognized those as our claims, perhaps we would be more willing to listen to the claims of others. Perhaps.

I first saw this in a tweet by Michael Conover.

### Pro Tips for Grad Students in Statistics/Biostatistics [Multi-Part]

Saturday, November 17th, 2012

A recounting of “pro-tips” on becoming a practicing applied statistician.

You may nod along or think some of them are “obvious.” Ask yourself, how many of these tips, adapted to your field, did you put into practice in the last week/month?

Don’t feel bad, I’m right there with you. But trying to do better.

Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)

Pro Tips for Grad Students in Statistics/Biostatistics (Part 2)

Pro-tips for graduate students (Part 3)

Pro-tips for graduate students (Part 4)

Bonus question: What pro-tips would you give to students who want to pursue semantic technologies, including topic maps?

### #FF00FF 2012

Sunday, November 4th, 2012

#FF00FF 2012 – An FAQ for the 2012 US Presidential Election by Peter Norvig.

From the webpage:

This is an FAQ (Frequently Asked Questions list) for the 2012 United States Presidential Election. I need to disclose up front that I support President Obama. However, with the exception of the very last question, this FAQ is designed as a collection of factual information (such as the latest poll results) and of analysis that is as objective as possible. Why did I do this? To educate interested readers. My ambitions are not as grandiose as Sam Wang of the Princeton Election Consortium, who wroteWhen I started doing the Meta-Analysis of State Polls in 2004, I thought it would be a useful tool to get rid of media noise about individual polls. … Space would be opened up for discussion of what really mattered in the campaign – or even discussion of policies. To my disappointment, this has not happened. Maybe it just takes time. Or perhaps polling nerds need to get a few more races right. Let’s see if we move the ball forward for Team Geek on Tuesday.

Peter Norvig is the Director of Research at Google.

This being the election week in the United States, I thought it might be of general interest.

But I posted it for another reason as well. Even if you like the “presentation” of data in this document, is this the best way to store the data in this document?

If you instinctively say yes, how do I point to the line for Vermont in the State-by-State Forecasts table?

Surely if you are going to store data you had some notion of how to get parts of it out again. Yes? Or do I have to get the entire data store, this document, out again to find that information?

That sounds remarkably lame.

I am not questioning the use of a document to present information, +1! to that.

My question is the suitability of a document for storing data? Storage implying some means of retrieval.

### Data Mining Book Review: Dance with Chance

Sunday, November 4th, 2012

Data Mining Book Review: Dance with Chance by Sandro Saitta.

From the post:

If you ever worked on time series prediction (forecasting), you should read Dance with Chance. It is written by a statistician, a psychologist and a decision scientist (Makriddakis, Hogarth and Gaba). As it is the case in The Numerati or Super Crunchers, authors explain complex notions to a non-expert audience. I find the book really interesting and provocative.

The main concept of Dance with Chance is the “illusion of control”. It is when you think you control a future event or situation, that is in fact mainly due to chance. This is the opposite of fatalism (when you think you have no control, although you have). The book teaches how to avoid being fooled by this illusion of control. This is a very interesting reading for any data miner, particularly involved with forecasting. The books contains dozens of examples of the limitation of forecasting techniques. For example, it explains the issues of forecasting the stock market and when predictions are due to chance. Authors use a brilliant mix of statistics and psychology to prove their point.

From the review this sounds like an interesting read.

Forecasting can be useful but being aware of its limitations is as well.

### Think Bayes: Bayesian Statistics Made Simple

Thursday, October 11th, 2012

Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey.

Think Bayes is an introduction to Bayesian statistics using computational methods. This version of the book is a rough draft. I am making this draft available for comments, but it comes with the warning that it is probably full of errors.

Allen has written free books on Python, statistics, complexity and now Bayesian statistics.

If you don’t know his books, good opportunity to give them a try.

### The 13 Steps to Running Any Statistical Model (Webinar)

Tuesday, October 9th, 2012

The 13 Steps to Running Any Statistical Model

Webinar:

Date: December 5, 2012

Time: 3pm Eastern Time UTC -4 (2pm Central, 1pm Mountain, 12pm Pacific)

From the post:

All statistical modeling–whether ANOVA, Multiple Regression, Poisson Regression, Multilevel Model–is about understanding the relationship between independent and dependent variables. The content differs, but as a data analyst, you need to follow the same 13 steps to complete your modeling.

This webinar will give you an overview of these 13 steps:

• what they are
• why each one is important
• the general order in which to do them
• on which steps the different types of modeling differ and where they’re the same

Having a road map for the steps to take will make your modeling more efficient and keep you on track.

Whether the model is the point of your analysis or you are using statistical model to discover subjects, this could be useful.

### Stan (Bayesian Inference) [update]

Sunday, October 7th, 2012

Stan

From the webpage:

Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo.

I first reported on a presentation: Stan: A (Bayesian) Directed Graphical Model Compiler last January when Stan was unreleased.

Following a link from Christophe Lalanne’s A bag of tweets / September 2012, I find the released version of the software!

Very cool!

### Revisiting “Ranking the popularity of programming languages”: creating tiers

Sunday, October 7th, 2012

From the post:

In a post on dataists almost two years ago, John Myles White and I posed the question: “How would you rank the popularity of a programming language?”.

From the original post:

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O’Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.

Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.

I would not down play the importance of Drew’s descriptive analysis.

Until you can describe something, it is really difficult to explain it.

### It takes time: A remarkable example of delayed recognition

Saturday, October 6th, 2012

It takes time: A remarkable example of delayed recognition by Ben Van Calster. (Van Calster, B. (2012), It takes time: A remarkable example of delayed recognition. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22732)

Abstract:

The way in which scientific publications are picked up by the research community can vary. Some articles become instantly cited, whereas others go unnoticed for some time before they are discovered or rediscovered. Papers with delayed recognition have also been labeled “sleeping beauties.” I briefly discuss an extreme case of a sleeping beauty. Peirce’s short note in Science in 1884 shows a remarkable increase in citations since around 2000. The note received less than 1 citation per year in the decades prior to 2000, 3.5 citations per year in the 2000s, and 10.4 in the 2010s. This increase was seen in several domains, most notably meteorology, medical prediction research, and economics. The paper outlines formulas to evaluate a binary prediction system for a binary outcome. This citation increase in various domains may be attributed to a widespread, growing research focus on mathematical prediction systems and the evaluation thereof. Several recently suggested evaluation measures essentially reinvented or extended Peirce’s 120-year-old ideas.

I would call your attention to the last line of the abstract:

Several recently suggested evaluation measures essentially reinvented or extended Peirce’s 120-year-old ideas.

I take that to mean with better curation of ideas, perhaps we would invent different ideas?

The paper ends:

To conclude, the simple ideas presented in Peirce’s note have been reinvented and rediscovered several decades or even more than a century later. It is fascinating that we arrive at ideas presented more than a century ago, and that Peirce’s ideas on the evaluation of predictions have come to the surface regularly across time and discipline. A saying, attributed to Ivan Pavlov, goes: “If you want new ideas, read old books.”

What old books are you going to read this weekend?

PS: Just curious. What search terms would you use, other than the author’s name and article title, to insure that you could find this article again? What about information across the various fields cited in the article to find related information?

### Working More Effectively With Statisticians

Sunday, September 23rd, 2012

Working More Effectively With Statisticians by Deborah M. Anderson. (Fall 2012 Newsletter of Society for Clinical Data Management, pages 5-8)

Abstract:

The role of the clinical trial biostatistician is to lend scientific expertise to the goal of demonstrating safety and efficacy of investigative treatments. Their success, and the outcome of the clinical trial, is predicated on adequate data quality, among other factors. Consequently, the clinical data manager plays a critical role in the statistical analysis of clinical trial data. In order to better fulfill this role, data managers must work together with the biostatisticians and be aligned in their understanding of data quality. This article proposes ten specific recommendations for data managers in order to facilitate more effective collaboration with biostatisticians.

See the article for the details but the recommendations are generally applicable to all data collection projects:

Recommendation #1: Communicate early and often with the biostatistician and provide frequent data extracts for review.

Recommendation #2: Employ caution when advising sites or interactive voice/web recognition (IVR/IVW) vendors on handling of randomization errors.

Recommendation #3: Collect the actual investigational treatment and dose group for each subject.

Recommendation #4: Think carefully and consult the biostatistician about the best way to structure investigational treatment exposure and accountability data.

Recommendation #5: Clarify in electronic data capture (EDC) specifications whether a question is only a “prompt” screen or whether the answer to the question will be collected explicitly in the database.

Recommendation #6: Recognize the most critical data items from a statistical analysis perspective and apply the highest quality standards to them.

Recommendation #7: Be alert to protocol deviations/violations (PDVs).

Recommendation #8: Plan for a database freeze and final review before database lock.

Recommendation #9: Archive a snapshot of the clinical database at key analysis milestones and at the end of the study.

Recommendation #10: Educate yourself about fundamental statistical principles whenever the opportunity arises.

I first saw this at John Johnson’s Data cleaning is harder than statistical analysis.

### Statistical Data Mining Tutorials

Monday, September 17th, 2012

Statistical Data Mining Tutorials by Andrew Moore.

From the post:

The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.

These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.

Perhaps a bit dated but not seriously so.

And one never knows when a slightly different explanation will make something obscure suddenly clear.