Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 2, 2013

An Applet for the Investigation of Simpson’s Paradox

Filed under: BigData,Mathematics,Statistics — Patrick Durusau @ 6:17 am

An Applet for the Investigation of Simpson’s Paradox by Kady Schneiter and Jürgen Symanzik. (Journal of Statistics Education, Volume 21, Number 1 (2013))

Simpson’s paradox is best illustrated by the University of California, Berkeley sex discrimination case. Taken in the aggregate, admissions to the graduate school appeared to greatly favor men. Taken by department, no department discriminated against women and most favored admission of women. Same data, different level of examination. That is Simpson’s paradox.

Abstract:

This article describes an applet that facilitates investigation of Simpson’s Paradox in the context of a number of real and hypothetical data sets. The applet builds on the Baker-Kramer graphical representation for Simpson’s Paradox. The implementation and use of the applet are explained. This is followed by a description of how the applet has been used in an introductory statistics class and a discussion of student responses to the applet.

From Wikipedia on Simpson’s Paradox:

In probability and statistics, Simpson’s paradox, or the Yule–Simpson effect, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in social-science and medical-science statistics,[1] and is particularly confounding when frequency data are unduly given causal interpretations.[2] Simpson’s Paradox disappears when causal relations are brought into consideration.

A cautionary tale about the need to understand data sets and how combining them may impact outcomes of statistical analysis.

Journal of Statistics Education

Filed under: BigData,Mathematics,Statistics — Patrick Durusau @ 5:56 am

Journal of Statistics Education

From the mission statement:

The Journal of Statistics Education (JSE) disseminates knowledge for the improvement of statistics education at all levels, including elementary, secondary, post-secondary, post-graduate, continuing, and workplace education. It is distributed electronically and, in accord with its broad focus, publishes articles that enhance the exchange of a diversity of interesting and useful information among educators, practitioners, and researchers around the world. The intended audience includes anyone who teaches statistics, as well as those interested in research on statistical and probabilistic reasoning. All submissions are rigorously refereed using a double-blind peer review process.

Manuscripts submitted to the journal should be relevant to the mission of JSE. Possible topics for manuscripts include, but are not restricted to: curricular reform in statistics, the use of cooperative learning and projects, innovative methods of instruction, assessment, and research (including case studies) on students’ understanding of probability and statistics, research on the teaching of statistics, attitudes and beliefs about statistics, creative and tested ideas (including experiments and demonstrations) for teaching probability and statistics topics, the use of computers and other media in teaching, statistical literacy, and distance education. Articles that provide a scholarly overview of the literature on a particular topic are also of interest. Reviews of software, books, and other teaching materials will also be considered, provided these reviews describe actual experiences using the materials.

In addition JSE also features departments called “Teaching Bits: A Resource for Teachers of Statistics” and “Datasets and Stories.” “Teaching Bits” summarizes interesting current events and research that can be used as examples in the statistics classroom, as well as pertinent items from the education literature. The “Datasets and Stories” department not only identifies interesting datasets and describes their useful pedagogical features, but enables instructors to download the datasets for further analysis or dissemination to students.

Associated with the Journal of Statistics Education is the JSE Information Service. The JSE Information Service provides a source of information for teachers of statistics that includes the archives of EDSTAT-L (an electronic discussion list on statistics education), information about the International Association for Statistical Education, and links to many other statistics education sources.

If you are going to talk about big data, of necessity you are also going to talk about statistics.

A very good free online resource on statistics.

March 30, 2013

Using R For Statistical Analysis – Two Useful Videos

Filed under: Data Mining,R,Statistics — Patrick Durusau @ 6:29 pm

Using R For Statistical Analysis – Two Useful Videos by Bruce Berriman.

Bruce has uncovered two interesting videos on using R:

Introduction to R – A Brief Tutorial for R (Software for Statistical Analysis), and,

An Introduction to R for Data Mining by Joseph Rickert. (Recording of the webinar by the same name.)

Bruce has additional links that will be useful with the videos.

Enjoy!

March 29, 2013

The Artful Business of Data Mining…

Filed under: Data Mining,Software,Statistics — Patrick Durusau @ 8:25 am

David Coallier has two presentations under that general title:

Distributed Schema-less Document-Based Databases

and,

Computational Statistics with Open Source Tools

Neither one of which is a “…death by powerpoint…” type presentation where the speaker reads text you can read for yourself.

Which is good, except that with minimal slides, you get an occasional example, names of software/techniques, but you have to fill in a lot of context.

A pointer to videos of either of these presentations would be greatly appreciated!

March 12, 2013

RDF Data Cube Vocabulary [Last Call ends 08 April 2013]

Filed under: Data Cubes,RDF,RDF Data Cube Vocabulary,Statistics — Patrick Durusau @ 1:57 pm

RDF Data Cube Vocabulary

Abstract:

There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets.

If you have comments, now would be a good time to finish them up for submission.

I first saw this in a tweet by Sandro Hawke.

February 25, 2013

R Bootcamp Materials!

Filed under: R,Statistics — Patrick Durusau @ 2:02 pm

R Bootcamp Materials! by Jared Knowles.

From the post:

To train new employees at the Wisconsin Department of Public Instruction, I have developed a 2-3 day series of training modules on how to get work done in R. These modules cover everything from setting up and installing R and RStudio to creating reproducible analyses using the knitr package. There are also some experimental modules for introductions to basic computer programming, and a refresher course on statistics. I hope to improve both of these over time. 

I am happy to announce that all of these materials are available online, for free.

​The bootcamp covers the following topics:

  1. Introduction to R​ : History of R, R as a programming language, and features of R.
  2. Getting Data In :​ How to import data into R, manipulate, and manage multiple data objects. 
  3. Sorting and Reshaping Data :  ​Long to wide, wide to long, and everything in between!
  4. Cleaning Education Data​ : Includes material from the Strategic Data Project about how to implement common business rules in processing administrative data. 
  5. Regression and Basic Analytics in R​ : Using school mean test scores to do OLS regression and regression diagnostics — a real world example. 
  6. Visualizing Data : ​Harness the power of R’s data visualization packages to make compelling and informative visualizations.
  7. Exporting Your Work : ​Learn the knitr​ package, and how to export graphics, and create PDF reports.
  8. Advanced Topics :​ A potpourri of advanced features in R (by request)
  9. A Statistics Refresher : ​With interactive examples using shiny​ 
  10. ​Programming Principles : ​Tips and pointers about writing code. (Needs work)

The best part is, all of the materials are available online and free of charge! (Check out the R Bootcamp page). They are constantly evolving. We have done two R Bootcamps so far, and hope to do more. Each time the materials get a little better. ​

The R Bootcamp page enables you to download all the materials or view the modules separately.

If you already know R, pass it on.

February 18, 2013

VOStat: A Statistical Web Service… [Open Government, Are You Listening?]

Filed under: Astroinformatics,Statistics,Topic Maps,VOStat — Patrick Durusau @ 11:59 am

VOStat: A Statistical Web Service for Astronomers

From the post:

VOStat is a simple statistical web service that lets you analyze your data without the hassle of downloading or installing any software. VOStat provides interactive statistical analysis of astronomical tabular datasets. It is integrated into the suite of analysis and visualization tools associated with the Virtual Observatory (VO) through the SAMP communication system. A user supplies VOStat with a dataset and chooses among ~60 statistical functions, including data transformations, plots and summaries, density estimation, one- and two-sample hypothesis tests, global and local regressions, multivariate analysis and clustering, spatial analysis, directional statistics, survival analysis , and time series analysis. VOStat was developed by the Center for Astrostatistics (Pennsylvania State University).

The astronomical community has data sets that dwarf any open government data set and they have ~ 60 statistical functions?

Whereas in open government data, dumping data files to public access is considered being open?

The technology to do better already exists.

So, what is your explanation for defining openness as “data dumps to the web?”


PS: Have you ever thought about creating a data interface that hold mappings between data sets, such as a topic map would produce?

Would papering over agency differences in terminology assist users in taking advantage of their data sets? (Subject to disclosure that is happening.)

Would you call that a “TMStat: A Topic Map Statistical Web Service?”

(Disclosure of the basis for mapping being what distinguishes a topic map statistical web service from a fixed mapping between undefined column headers in different tables.)

February 16, 2013

NBA Stats Like Never Before [No RDF/Linked Data/Topic Maps In Sight]

Filed under: Design,Interface Research/Design,Linked Data,RDF,Statistics,Topic Maps — Patrick Durusau @ 4:47 pm

NBA Stats Like Never Before by Timo Elliott.

From the post:

The National Baseball Association today unveiled a new site for fans of games statistics: NBA.com/stats, powered by SAP Analytics technology. The multi-year marketing partnership between SAP and the NBA was announced six months ago:

“We are constantly researching new and emerging technologies in an effort to provide our fans with new ways to connect with our game,” said NBA Commissioner David Stern. “SAP is a leader in providing innovative software solutions and an ideal partner to provide a dynamic and comprehensive statistical offering as fans interact with NBA basketball on a global basis.”

“SAP is honored to partner with the NBA, one of the world’s most respected sports organizations,” said Bill McDermott, co-CEO, SAP. “Through SAP HANA, fans will be able to experience the NBA as never before. This is a slam dunk for SAP, the NBA and the many fans who will now have access to unprecedented insight and analysis.”

The free database contains every box score of every game played since the league’s inception in 1946, including graphical displays of players shooting tendencies.

To the average fan NBA.com/Stats delivers information that is of immediate interest to them, not their computers.

Another way to think about it:

Computers don’t make purchasing decisions, users do.

Something to think about when deciding on your next semantic technology.

January 13, 2013

Outlier Analysis

Filed under: Data Analysis,Outlier Detection,Probability,Statistics — Patrick Durusau @ 8:15 pm

Outlier Analysis by Charu Aggarwal (Springer, January 2013). Post by Gregory Piatetsky.

From the post:

This is an authored text book on outlier analysis. The book can be considered a first comprehensive text book in this area from a data mining and computer science perspective. Most of the earlier books in outlier detection were written from a statistical perspective, and precede the emergence of the data mining field over the last 15-20 years.

Each chapter contains carefully organized content on the topic, case studies, extensive bibliographic notes and the future direction of research in this field. Thus, the book can also be used as a reference aid. Emphasis was placed on simplifying the content, so that the material is relatively easy to assimilate. The book assumes relatively little prior background, other than a very basic understanding of probability and statistical concepts. Therefore, in spite of its deep coverage, it can also provide a good introduction to the beginner. The book includes exercises as well, so that it can be used as a teaching aid.

Table of Contents and Introduction. Includes exercises and a 500+ reference bibliography.

Definitely a volume for the short reading list.

Caveat: As an outlier by any measure, my opinions here may be biased. 😉

December 29, 2012

Missing-Data Imputation

Filed under: Bayesian Data Analysis,Statistics — Patrick Durusau @ 6:19 am

New book by Stef van Buuren on missing-data imputation looks really good! by Andrew Gelman.

From the post:

Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the $89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with their particular implementation, which is one reason we’re developing our own package); he supplies lots of intuition, examples, and graphs.

Steve Newcomb makes the point that data is dirty. Always.

Stef van Buuren suggests that data may be missing and requires imputation.

Together that means dirty data may be missing and requires imputation.

😉

Imputed or not, data is no more reliable than we are. Use with caution.

Analyzing the Enron Data…

Filed under: Clustering,PageRank,Statistics,Text Analytics,Text Mining — Patrick Durusau @ 6:07 am

Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-).

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

  • Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
  • Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
  • Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

Analyzing Categorical Data

Filed under: Categorical Data,Inference,Statistics — Patrick Durusau @ 5:57 am

Analyzing Categorical Data by Jeffrey S. Simonoff.

Mentioned in My Intro to Multiple Classification… but thought it merited a more prominent mention.

From the webpage:

Welcome to the web site for the book Analyzing Categorical Data, published by Springer-Verlag in July 2003 as part of the Springer Texts in Statistics series. This site allows access to the data sets used in the book, S-PLUS/R and SAS code to perform the analyses in the book, some general information on statistical software for analyzing categorical data, and an errata list. I would be very happy to receive comments on this site, and on the book itself.

Data sets, code to duplicate the analysis in the book and other information at this site.

December 6, 2012

Advanced Data Analysis from an Elementary Point of View

Filed under: Data Analysis,Mathematics,Statistics — Patrick Durusau @ 11:35 am

Advanced Data Analysis from an Elementary Point of View by Cosma Rohilla Shalizi. (UPDATE: 2014 draft

From the Introduction:

These are the notes for 36-402, Advanced Data Analysis, at Carnegie Mellon. If you are not enrolled in the class, you should know that it’s the methodological capstone of the core statistics sequence taken by our undergraduate majors (usually in their third year), and by students from a range of other departments. By this point, they have taken classes in introductory statistics and data analysis, probability theory, mathematical statistics, and modern linear regression (“401”). This class does not presume that you have learned but forgotten the material from the pre-requisites; it presumes that you know that material and can go beyond it. The class also presumes a firm grasp on linear algebra and multivariable calculus, and that you can read and write simple functions in R. If you are lacking in any of these areas, now would be an excellent time to leave.

36-402 is a class in statistical methodology: its aim is to get students to understand something of the range of modern1 methods of data analysis, and of the considerations which go into choosing the right method for the job at hand (rather than distorting the problem to fit the methods the student happens to know). Statistical theory is kept to a minimum, and largely introduced as needed.

[Footnote 1] Just as an undergraduate “modern physics” course aims to bring the student up to about 1930 (more specifically, to 1926), this class aims to bring the student up to about 1990.

Very recent introduction to data analysis. Shalizi includes a list of concepts in the introduction that best be mastered before tackling this material.

According to footnote 1, when you have mastered this material, you have another twenty-two years to make up in general and on your problem in particular.

Still, knowing it cold will put you ahead of a lot of data analysis you are going to encounter.

I first saw this in a tweet by Gene Golovchinsky.

December 5, 2012

The Elements of Statistical Learning (2nd ed.)

Filed under: Machine Learning,Mathematics,Statistical Learning,Statistics — Patrick Durusau @ 6:50 am

The Elements of Statistical Learning (2nd ed.) by Trevor Hastie, Robert Tibshirani and Jerome Friedman. (PDF)

The authors note in the preface to the first edition:

The field of Statistics is constantly challenged by the problems that science and industry brings to its door. In the early days, these problems often came from agricultural and industrial experiments and were relatively small in scope. With the advent of computers and the information age, statistical problems have exploded both in size and complexity. Challenges in the areas of data storage, organization and searching have led to the new field of “data mining”; statistical and computational problems in biology and medicine have created “bioinformatics.” Vast amounts of data are being generated in many fields, and the statistician’s job is to make sense of it all: to extract important patterns and trends, and understand “what the data says.” We call this learning from data.

I’m sympathetic to that sentiment but with the caveat that it is our semantic expectations of the data that give it any meaning to be “learned.”

Data isn’t lurking outside our door with “meaning” captured separate and apart from us. Our fancy otherwise obscures our role in the origin of “meaning” that we attach to data. In part to bolster the claim that the “facts/data say….”

It is us who take up the gauge for our mute friends, facts/data, and make claims on their behalf.

If we recognized those as our claims, perhaps we would be more willing to listen to the claims of others. Perhaps.

I first saw this in a tweet by Michael Conover.

November 17, 2012

Pro Tips for Grad Students in Statistics/Biostatistics [Multi-Part]

Filed under: Biostatistics,Statistics,Students — Patrick Durusau @ 2:34 pm

A recounting of “pro-tips” on becoming a practicing applied statistician.

You may nod along or think some of them are “obvious.” Ask yourself, how many of these tips, adapted to your field, did you put into practice in the last week/month?

Don’t feel bad, I’m right there with you. But trying to do better.

Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)

Pro Tips for Grad Students in Statistics/Biostatistics (Part 2)

Pro-tips for graduate students (Part 3)

Pro-tips for graduate students (Part 4)

Bonus question: What pro-tips would you give to students who want to pursue semantic technologies, including topic maps?

November 4, 2012

#FF00FF 2012

Filed under: Government,Statistics — Patrick Durusau @ 9:16 pm

#FF00FF 2012 – An FAQ for the 2012 US Presidential Election by Peter Norvig.

From the webpage:

This is an FAQ (Frequently Asked Questions list) for the 2012 United States Presidential Election. I need to disclose up front that I support President Obama. However, with the exception of the very last question, this FAQ is designed as a collection of factual information (such as the latest poll results) and of analysis that is as objective as possible. Why did I do this? To educate interested readers. My ambitions are not as grandiose as Sam Wang of the Princeton Election Consortium, who wroteWhen I started doing the Meta-Analysis of State Polls in 2004, I thought it would be a useful tool to get rid of media noise about individual polls. … Space would be opened up for discussion of what really mattered in the campaign – or even discussion of policies. To my disappointment, this has not happened. Maybe it just takes time. Or perhaps polling nerds need to get a few more races right. Let’s see if we move the ball forward for Team Geek on Tuesday.

Peter Norvig is the Director of Research at Google.

This being the election week in the United States, I thought it might be of general interest.

But I posted it for another reason as well. Even if you like the “presentation” of data in this document, is this the best way to store the data in this document?

If you instinctively say yes, how do I point to the line for Vermont in the State-by-State Forecasts table?

Surely if you are going to store data you had some notion of how to get parts of it out again. Yes? Or do I have to get the entire data store, this document, out again to find that information?

That sounds remarkably lame.

I am not questioning the use of a document to present information, +1! to that.

My question is the suitability of a document for storing data? Storage implying some means of retrieval.

Data Mining Book Review: Dance with Chance

Filed under: Prediction,Statistics — Patrick Durusau @ 8:40 pm

Data Mining Book Review: Dance with Chance by Sandro Saitta.

From the post:

If you ever worked on time series prediction (forecasting), you should read Dance with Chance. It is written by a statistician, a psychologist and a decision scientist (Makriddakis, Hogarth and Gaba). As it is the case in The Numerati or Super Crunchers, authors explain complex notions to a non-expert audience. I find the book really interesting and provocative.

The main concept of Dance with Chance is the “illusion of control”. It is when you think you control a future event or situation, that is in fact mainly due to chance. This is the opposite of fatalism (when you think you have no control, although you have). The book teaches how to avoid being fooled by this illusion of control. This is a very interesting reading for any data miner, particularly involved with forecasting. The books contains dozens of examples of the limitation of forecasting techniques. For example, it explains the issues of forecasting the stock market and when predictions are due to chance. Authors use a brilliant mix of statistics and psychology to prove their point.

From the review this sounds like an interesting read.

Forecasting can be useful but being aware of its limitations is as well.

October 11, 2012

Think Bayes: Bayesian Statistics Made Simple

Filed under: Bayesian Data Analysis,Bayesian Models,Mathematics,Statistics — Patrick Durusau @ 3:24 pm

Think Bayes: Bayesian Statistics Made Simple by Allen B. Downey.

Think Bayes is an introduction to Bayesian statistics using computational methods. This version of the book is a rough draft. I am making this draft available for comments, but it comes with the warning that it is probably full of errors.

Allen has written free books on Python, statistics, complexity and now Bayesian statistics.

If you don’t know his books, good opportunity to give them a try.

October 9, 2012

The 13 Steps to Running Any Statistical Model (Webinar)

Filed under: Statistics — Patrick Durusau @ 1:46 pm

The 13 Steps to Running Any Statistical Model

Webinar:

Date: December 5, 2012

Time: 3pm Eastern Time UTC -4 (2pm Central, 1pm Mountain, 12pm Pacific)

From the post:

All statistical modeling–whether ANOVA, Multiple Regression, Poisson Regression, Multilevel Model–is about understanding the relationship between independent and dependent variables. The content differs, but as a data analyst, you need to follow the same 13 steps to complete your modeling.

This webinar will give you an overview of these 13 steps:

  • what they are
  • why each one is important
  • the general order in which to do them
  • on which steps the different types of modeling differ and where they’re the same

Having a road map for the steps to take will make your modeling more efficient and keep you on track.

Whether the model is the point of your analysis or you are using statistical model to discover subjects, this could be useful.

October 7, 2012

Stan (Bayesian Inference) [update]

Filed under: Bayesian Data Analysis,Bayesian Models,Statistics — Patrick Durusau @ 4:17 pm

Stan

From the webpage:

Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo.

I first reported on a presentation: Stan: A (Bayesian) Directed Graphical Model Compiler last January when Stan was unreleased.

Following a link from Christophe Lalanne’s A bag of tweets / September 2012, I find the released version of the software!

Very cool!

Revisiting “Ranking the popularity of programming languages”: creating tiers

Filed under: Data Mining,Graphics,Statistics,Visualization — Patrick Durusau @ 4:05 pm

Revisiting “Ranking the popularity of programming languages”: creating tiers by Drew Conway.

From the post:

In a post on dataists almost two years ago, John Myles White and I posed the question: “How would you rank the popularity of a programming language?”.

From the original post:

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O’Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.

Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.

I would not down play the importance of Drew’s descriptive analysis.

Until you can describe something, it is really difficult to explain it. 😉

October 6, 2012

It takes time: A remarkable example of delayed recognition

Filed under: Marketing,Peirce,Statistics — Patrick Durusau @ 6:27 pm

It takes time: A remarkable example of delayed recognition by Ben Van Calster. (Van Calster, B. (2012), It takes time: A remarkable example of delayed recognition. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22732)

Abstract:

The way in which scientific publications are picked up by the research community can vary. Some articles become instantly cited, whereas others go unnoticed for some time before they are discovered or rediscovered. Papers with delayed recognition have also been labeled “sleeping beauties.” I briefly discuss an extreme case of a sleeping beauty. Peirce’s short note in Science in 1884 shows a remarkable increase in citations since around 2000. The note received less than 1 citation per year in the decades prior to 2000, 3.5 citations per year in the 2000s, and 10.4 in the 2010s. This increase was seen in several domains, most notably meteorology, medical prediction research, and economics. The paper outlines formulas to evaluate a binary prediction system for a binary outcome. This citation increase in various domains may be attributed to a widespread, growing research focus on mathematical prediction systems and the evaluation thereof. Several recently suggested evaluation measures essentially reinvented or extended Peirce’s 120-year-old ideas.

I would call your attention to the last line of the abstract:

Several recently suggested evaluation measures essentially reinvented or extended Peirce’s 120-year-old ideas.

I take that to mean with better curation of ideas, perhaps we would invent different ideas?

The paper ends:

To conclude, the simple ideas presented in Peirce’s note have been reinvented and rediscovered several decades or even more than a century later. It is fascinating that we arrive at ideas presented more than a century ago, and that Peirce’s ideas on the evaluation of predictions have come to the surface regularly across time and discipline. A saying, attributed to Ivan Pavlov, goes: “If you want new ideas, read old books.”

What old books are you going to read this weekend?

PS: Just curious. What search terms would you use, other than the author’s name and article title, to insure that you could find this article again? What about information across the various fields cited in the article to find related information?

September 23, 2012

Working More Effectively With Statisticians

Filed under: Bioinformatics,Biomedical,Data Quality,Statistics — Patrick Durusau @ 10:33 am

Working More Effectively With Statisticians by Deborah M. Anderson. (Fall 2012 Newsletter of Society for Clinical Data Management, pages 5-8)

Abstract:

The role of the clinical trial biostatistician is to lend scientific expertise to the goal of demonstrating safety and efficacy of investigative treatments. Their success, and the outcome of the clinical trial, is predicated on adequate data quality, among other factors. Consequently, the clinical data manager plays a critical role in the statistical analysis of clinical trial data. In order to better fulfill this role, data managers must work together with the biostatisticians and be aligned in their understanding of data quality. This article proposes ten specific recommendations for data managers in order to facilitate more effective collaboration with biostatisticians.

See the article for the details but the recommendations are generally applicable to all data collection projects:

Recommendation #1: Communicate early and often with the biostatistician and provide frequent data extracts for review.

Recommendation #2: Employ caution when advising sites or interactive voice/web recognition (IVR/IVW) vendors on handling of randomization errors.

Recommendation #3: Collect the actual investigational treatment and dose group for each subject.

Recommendation #4: Think carefully and consult the biostatistician about the best way to structure investigational treatment exposure and accountability data.

Recommendation #5: Clarify in electronic data capture (EDC) specifications whether a question is only a “prompt” screen or whether the answer to the question will be collected explicitly in the database.

Recommendation #6: Recognize the most critical data items from a statistical analysis perspective and apply the highest quality standards to them.

Recommendation #7: Be alert to protocol deviations/violations (PDVs).

Recommendation #8: Plan for a database freeze and final review before database lock.

Recommendation #9: Archive a snapshot of the clinical database at key analysis milestones and at the end of the study.

Recommendation #10: Educate yourself about fundamental statistical principles whenever the opportunity arises.

I first saw this at John Johnson’s Data cleaning is harder than statistical analysis.

September 17, 2012

Statistical Data Mining Tutorials

Filed under: Data Mining,Statistics — Patrick Durusau @ 6:25 pm

Statistical Data Mining Tutorials by Andrew Moore.

From the post:

The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.

These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.

Perhaps a bit dated but not seriously so.

And one never knows when a slightly different explanation will make something obscure suddenly clear.

Probability and Statistics Cookbook

Filed under: Mathematics,Probability,Statistics — Patrick Durusau @ 6:16 pm

Probability and Statistics Cookbook by Matthias Vallentin.

From the webpage:

The cookbook contains a succinct representation of various topics in probability theory and statistics. It provides a comprehensive reference reduced to the mathematical essence, rather than aiming for elaborate explanations.

Very summary presentation so better as a quick reminder type resource.

I was particularly impressed by the univariate distribution relationships map on the last page.

In that regard, you may want to look at John D. Cook’s Diagram of distribution relationships
and the links therein.

September 15, 2012

Nonparametric Techniques – Webinar [Think Movie Ratings]

Filed under: Nonparametric,Recommendation,Statistics — Patrick Durusau @ 2:30 pm

Overview of Nonparametric Techniques with Elaine Eisenbeisz.

Date: October 3, 2012

Time: 3pm Eastern Time UTC -4 (2pm Central, 1pm Mountain, 12pm Pacific)

From the description:

A distribution of data which is not normal does not mean it is abnormal. There are many data analysis techniques which do not require the assumption of normality.

This webinar will provide information on when it is best to use nonparametric alternatives and provides information on suggested tests to use in lieu of:

  • Independent samples and paired t-tests
  • Analysis of variance techniques
  • Pearson’s Product Moment Correlation
  • Repeated measures designs

A description of nonparametric techniques for use with count data and contingency tables will also be provided.

Movie ratings, a ranked population, are appropriate for nonparametric methods.

You just thought you didn’t know anything about nonparametric methods. 😉

Applicable to all ranked populations (can you say recommendation?).

While you wait for the webinar, try some of the references from Wikipedia: Nonparametric Statistics.

September 13, 2012

Prison Polling [If You Don’t Ask, You Won’t Know]

Filed under: Data,Design,Statistics — Patrick Durusau @ 9:44 am

Prison Polling by Carl Bialik.

From the post:

My print column examines the argument of a book out this week that major federal surveys are missing an important part of the population by not polling prisoners.

“We’re missing 1% of the population,” said Becky Pettit, a University of Washington sociologist and author of the book, “Invisible Men.” “People might say, ‘That’s not a big deal.’ “But it is for some groups, she writes — particularly young black men. And for young black men, especially those without a high-school diploma, official statistics paint a rosier picture than reality on factors such as employment and voter turnout.

“Because many surveys skip institutionalized populations, and because we incarcerate lots of people, especially young black men with low levels of education, certain statistics can look rosier than if we included” prisoners in surveys, said Jason Schnittker, a sociologist at the University of Pennsylvania. “Whether you regard the impact as ‘massive’ depends on your perspective. The problem of incarceration tends to get swept under the rug in lots of different ways, rendering the issue invisible.”

A reminder that assumptions are cooked into data long before it reaches us for analysis.

If we don’t ask questions about data collection, we may be passing on results that don’t serve the best interests of our clients.

So for population data, ask (among other things):

  • Who was included/excluded?
  • How were the included selected?
  • On what basis were people excluded?
  • Where are the survey questions?
  • By what means were the questions asked? (phone, web, in person)
  • Time of day of survey?

and I am sure there are others.

Don’t be impressed by protests that your questions are irrelevant or the source has already “accounted” for that issue.

Right.

When someone protests you don’t need to know, you know where to push. Trust me on that one.

September 12, 2012

Wikipedia is dominated by male editors

Filed under: Graphics,Statistics,Visualization — Patrick Durusau @ 7:27 pm

Wikipedia is dominated by male editors by Nathan Yau.

From the post:

After he saw a New York Times article on the gender gap among Wikipedia contributors (The contributor base is only 13 percent women), Santiago Ortiz plotted articles by number of men versus number of women who edited. It’s interactive, so you can mouse over dots to see what article each represents, and you can zoom in for closer look in the bottom left.

This graphic merits wide circulation.

There isn’t a recipe for how to make such an effective graphic, other than perhaps to have studied equally effective graphics.

I will try to hunt down an example I saw many years ago that plotted population versus representation at the United Nations. If I can find it, you can draw your own conclusions about it.

In the mean time, if you spot graphics/visualizations that are clearly a cut above others, please share.

August 15, 2012

The Statistical Sleuth (second edition) in R

Filed under: R,Statistics — Patrick Durusau @ 7:59 pm

The Statistical Sleuth (second edition) in R by Nick Horton.

For those of you who teach, or are interested in seeing an illustrated series of analyses, there is a new compendium of files to help describe how to fit models for the extended case studies in the Second Edition of the Statistical Sleuth: A Course in Methods of Data Analysis (2002), the excellent text by Fred Ramsey and Dan Schafer. If you are using this book, or would like to see straightforward ways to undertake analyses in R for intro and intermediate statistics courses, these may be of interest.

This originally appeared at SAS and R.

August 11, 2012

Confusing Statistical Term #7: GLM

Filed under: Names,Statistics — Patrick Durusau @ 3:43 pm

Confusing Statistical Term #7: GLM by Karen Grace-Martin.

From the post:

Like some of the other terms in our list–level and beta–GLM has two different meanings.

It’s a little different than the others, though, because it’s an abbreviation for two different terms:

General Linear Model and Generalized Linear Model.

It’s extra confusing because their names are so similar on top of having the same abbreviation.

And, oh yeah, Generalized Linear Models are an extension of General Linear Models.

And neither should be confused with Generalized Linear Mixed Models, abbreviated GLMM.

Naturally.

So what’s the difference? And does it really matter?

As you probably have guessed, yes.

You will need a reading knowledge of statistics to really appreciate the post. If you don’t have such knowledge, now would be a good time to pick it up.

Statistics are a way of summarizing information about subjects. You can rely on the judgements of others on such summaries or you can have your own.

« Newer PostsOlder Posts »

Powered by WordPress