Archive for the ‘Statistics’ Category

Statistics vs. Machine Learning Dictionary (flat text vs. topic map)

Saturday, December 16th, 2017

Data science terminology (UBC Master of Data Science)

From the webpage:

About this document

This document is intended to help students navigate the large amount of jargon, terminology, and acronyms encountered in the MDS program and beyond. There is also an accompanying blog post.

Stat-ML dictionary

This section covers terms that have different meanings in different contexts, specifically statistics vs. machine learning (ML).
… (emphasis in original)

Gasp! You don’t mean that the same words have different meanings in machine learning and statistics!

Even more shocking, some words/acronyms, have the same meaning!

Never fear, a human reader can use this document to distinguish the usages.

Automated processors, not so much.

If these terms were treated as occurrences of topics, where the topics had the respective scopes of statistics and machine-learning, then for any scoped document, an enhanced view with the correct definition for the unsteady reader could be supplied.

Static markup of legacy documents is not required as annotations can be added as a document is streamed to a reader. Opening the potential, of course, for different annotations depending upon the skill and interest of the reader.

If for each term/subject, more properties than the scope of statistics or machine-learning or both were supplied, users of the topic map could search on those properties to match terms not included here. Such as which type of bias (in statistics) does bias mean in your paper? A casually written Wikipedia article reports twelve and with refinement, the number could be higher.

Flat text is far easier to write than a topic map but tasks every reader with re-discovering the distinctions already known to the author of the document.

Imagine your office, department, agency’s vocabulary and its definitions captured and then used to annotate internal or external documentation for your staff.

Instead of very new staffer asking (hopefully), what do we mean by (your common term), the definition appears with a mouse-over in a document.

Are you capturing the soft knowledge of your staff?

Statistical Functions for XQuery 3.1 (see OpenFormula)

Saturday, June 24th, 2017

simple-statsxq by Tim Thompson.

From the webpage:

Loosely inspired by the JavaScript simple-statistics project. The goal of this module is to provide a basic set of statistical functions for doing data analysis in XQuery 3.1.

Functions are intended to be implementation-agnostic.

Unit tests were written using the unit testing module of BaseX.

OpenFormula (Open Document Format for Office Applications (OpenDocument)) defines eighty-seven (87) statistical functions.

There are fifty-five (55) financial functions defined by OpenFormula, just in case you are interested.

If You Can’t See The Data, The Statistics Are False

Saturday, June 10th, 2017

The headline, If You Can’t See The Data, The Statistics Are False is my one line summary of 73.6% of all Statistics are Made Up – How to Interpret Analyst Reports by Mark Suster.

You should read Suster’s post in full, if for no other reason that his accounts of how statistics are created, that’s right, created, for reports:

But all of the data projections were so different so I decided to call some of the research companies and ask how they derived their data. I got the analyst who wrote one of the reports on the phone and asked how he got his projections. He must have been about 24. He said, literally, I sh*t you not, “well, my report was due and I didn’t have much time. My boss told me to look at the growth rate average over the past 3 years an increase it by 2% because mobile penetration is increasing.” There you go. As scientific as that.

I called another agency. They were more scientific. They had interviewed telecom operators, handset manufacturers and corporate buyers. They had come up with a CAGR (compounded annual growth rate) that was 3% higher that the other report, which in a few years makes a huge difference. I grilled the analyst a bit. I said, “So you interviewed the people to get a plausible story line and then just did a simple estimation of the numbers going forward?”

“Yes. Pretty much”

Write down the name of your favorite business magazine.

How many stories have you enjoyed over the past six months with “scientific” statistics like those?

Suster has five common tips for being a more informed consumer of data. All of which require effort on your part.

I have only one, which requires only reading on your part:

Can you see the data for the statistic? By that I mean is the original data, its collection method, who collected it, method of collection, when it was collected, etc., available to the reader?

If not, the statistic is either false or inflated.

The test I suggest is applicable at the point where you encounter the statistic. It puts the burden on the author who wants their statistic to be credited, to empower the user to evaluate their statistic.

Imagine the data analyst story where the growth rate statistic had this footnote:

1. Averaged growth rate over past three (3) years and added 2% at direction of management.

It reports the same statistic but also warns the reader the result is a management fantasy. Might be right, might be wrong.

Patronize publications with statistics + underlying data. Authors and publishers will get the idea soon enough.

John Carlisle Hunts Bad Science (you can too!)

Tuesday, June 6th, 2017

Carlisle’s statistics bombshell names and shames rigged clinical trials by Leonid Schneider.

From the post:

John Carlisle is a British anaesthesiologist, who works in a seaside Torbay Hospital near Exeter, at the English Channel. Despite not being a professor or in academia at all, he is a legend in medical research, because his amazing statistics skills and his fearlessness to use them exposed scientific fraud of several of his esteemed anaesthesiologist colleagues and professors: the retraction record holder Yoshitaka Fujii and his partner Yuhji Saitoh, as well as Scott Reuben and Joachim Boldt. This method needs no access to the original data: the number presented in the published paper suffice to check if they are actually real. Carlisle was fortunate also to have the support of his journal, Anaesthesia, when evidence of data manipulations in their clinical trials was found using his methodology. Now, the editor Carlisle dropped a major bomb by exposing many likely rigged clinical trial publications not only in his own Anaesthesia, but in five more anaesthesiology journals and two “general” ones, the stellar medical research outlets NEJM and JAMA. The clinical trials exposed in the latter for their unrealistic statistics are therefore from various fields of medicine, not just anaesthesiology. The medical publishing scandal caused by Carlisle now is perfect, and the elite journals had no choice but to announce investigations which they even intend to coordinate. Time will show how seriously their effort is meant.

Carlisle’s bombshell paper “Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals” was published today in Anaesthesia, Carlisle 2017, DOI: 10.1111/anae.13962. It is accompanied by an explanatory editorial, Loadsman & McCulloch 2017, doi: 10.1111/anae.13938. A Guardian article written by Stephen Buranyi provides the details. There is also another, earlier editorial in Anaesthesia, which explains Carlisle’s methodology rather well (Pandit, 2012).

… (emphasis in original)

Cutting to the chase, Carlisle found 90 papers with statistical patterns unlikely to occur by chance in 5,087 clinical trials.

There is a wealth of science papers to be investigated, Sarah Boon, in 21st Century Science Overload points out (2016) there are 2.5 million new scientific papers published every year, in 28,100 active scholarly peer-reviewed journals (2014).

Since Carlisle has done eight (8) journals, that leaves ~28,092 for your review. 😉

Happy hunting!

PS: I can easily imagine an exercise along these lines being the final project for a data mining curriculum. You?

How to Spot Visualization Lies

Monday, May 8th, 2017

How to Spot Visualization Lies : Keep your eyes open by Nathan Yau.

From the post:

It used to be that we’d see a poorly made graph or a data design goof, laugh it up a bit, and then carry on. At some point though — during this past year especially — it grew more difficult to distinguish a visualization snafu from bias and deliberate misinformation.

Of course, lying with statistics has been a thing for a long time, but charts tend to spread far and wide these days. There’s a lot of them. Some don’t tell the truth. Maybe you glance at it and that’s it, but a simple message sticks and builds. Before you know it, Leonardo DiCaprio spins a top on a table and no one cares if it falls or continues to rotate.

So it’s all the more important now to quickly decide if a graph is telling the truth. This a guide to help you spot the visualization lies.

Warning: Your blind acceptance/enjoyment of news graphics may be diminished by this post. You have been warned.

Beautifully illustrated as always.

Perhaps Nathan will product a double-sided, laminated version to keep by your TV chair. A great graduation present!

Geometry of Redistricting: Summer School (Apply Febuary 15 – March 31, 2017)

Monday, January 30th, 2017

Geometry of Redistricting: Summer School

From the webpage:

A 5-day summer school will be offered at Tufts University from August 7-11, 2017, with the principal purpose of training mathematicians to be expert witnesses for court cases on redistricting and gerrymandering.

Topics covered in the summer school will include:

  • the legal history of the Voting Rights Act and its subsequent renewals, extensions, and clarifications;
  • an explanation of “traditional districting principles,” especially compactness;
  • a course in metric geometry and mathematical ideas for perimeter-free compactness;
  • basic rudiments of GIS and the technical side of how shapefiles work;
  • training on being an expert witness;
  • ideas for incorporating voting and civil rights into mathematics teaching.

Some of the sessions in the summer school will be open to the public, and others will be limited to official participants. Partial funding for participants’ expenses will be available. The summer school is aimed at, but not limited to, people with doctoral training in mathematics. Preference will be given to those who can stay for the full week.

An application form will be posted on this website, and applications will be accepted from February 15 – March 31. Please contact to be added to the mailing list.

If you don’t have doctoral training in mathematics, consider the resources at: Gerrymandering and the shape of fairness, which self-describes as:

This site is devoted to the Metric Geometry and Gerrymandering Group run by Moon Duchin on understanding apportionment, districting, and gerrymandering as problems at the intersection of law, civil rights, and mathematics (particularly metric geometry).

Do you need a reminder the mid-term congressional elections in 2018 aren’t far away?


Three More Reasons To Learn R

Friday, January 6th, 2017

Three reasons to learn R today by David Smith.

From the post:

If you're just getting started with data science, the Sharp Sight Labs blog argues that R is the best data science language to learn today.

The blog post gives several detailed reasons, but the main arguments are:

  1. R is an extremely popular (arguably the most popular) data progamming language, and ranks highly in several popularity surveys.
  2. Learning R is a great way of learning data science, with many R-based books and resources for probability, frequentist and Bayesian statistics, data visualization, machine learning and more.
  3. Python is another excellent language for data science, but with R it's easier to learn the foundations.

Once you've learned the basics, Sharp Sight also argues that R is also a great data science to master, even though it's an old langauge compared to some of the newer alternatives. Every tool has a shelf life, but R isn't going anywhere and learning R gives you a foundation beyond the language itself.

If you want to get started with R, Sharp Sight labs offers a data science crash course. You might also want to check out the Introduction to R for Data Science course on EdX.

Sharp Sight Labs: Why R is the best data science language to learn today, and Why you should master R (even if it might eventually become obsolete)

If you need more reasons to learn R:

  • Unlike Facebook, R isn’t a sinkhole of non-testable propositions.
  • Unlike Instagram, R is rarely NSFW.
  • Unlike Twitter, R is a marketable skill.

Glad to hear you are learning R!

Q&A Cathy O’Neil…

Wednesday, January 4th, 2017

Q&A Cathy O’Neil, author of ‘Weapons of Math Destruction,’ on the dark side of big data by Christine Zhang.

From the post:

Cathy O’Neil calls herself a data skeptic. A former hedge fund analyst with a PhD in mathematics from Harvard University, the Occupy Wall Street activist left finance after witnessing the damage wrought by faulty math in the wake of the housing crash.

In her latest book, “Weapons of Math Destruction,” O’Neil warns that the statistical models hailed by big data evangelists as the solution to today’s societal problems, like which teachers to fire or which criminals to give longer prison terms, can codify biases and exacerbate inequalities. “Models are opinions embedded in mathematics,” she writes.

Great interview that hits enough high points to leave you wanting to learn more about Cathy and her analysis.

On that score, try:

Read her mathbabe blog.

Follow @mathbabedotorg.

Read Weapons of math destruction : how big data increases inequality and threatens democracy.

Try her new business: ORCAA [O’Neil Risk Consulting and Algorithmic Auditing].

From the ORCAA homepage:

ORCAA’s mission is two-fold. First, it is to help companies and organizations that rely on time and cost-saving algorithms to get ahead of this wave, to understand and plan for their litigation and reputation risk, and most importantly to use algorithms fairly.

The second half of ORCAA’s mission is this: to develop rigorous methodology and tools, and to set rigorous standards for the new field of algorithmic auditing.

There are bright line cases, sentencing, housing, hiring discrimination where “fair” has a binding legal meaning. And legal liability for not being “fair.”

Outside such areas, the search for “fairness” seems quixotic. Clients are entitled to their definitions of “fair” in those areas.

How to weigh a dog with a ruler? [Or Price a US Representative?]

Wednesday, December 14th, 2016

How to weigh a dog with a ruler? (looking for translators)

From the post:

We are working on a series of comic books that introduce statistical thinking and could be used as activity booklets in primary schools. Stories are built around adventures of siblings: Beta (skilled mathematician) and Bit (data hacker).

What is the connection between these comic books and R? All plots are created with ggplot2.

The first story (How to weigh a dog with a ruler?) is translated to English, Polish and Czech. If you would like to help us to translate this story to your native language, just write to me (przemyslaw.biecek at gmail) or create an issue on GitHub. It’s just 8 pages long, translations are available on Creative Commons BY-ND licence.

The key is to chart animals by their height as against their weight.

Pricing US Representatives is likely to follow a similar relationship where their priced goes up by years of service in Congress.

I haven’t run the data but such a chart would keep “people” (includes corporations in the US) from paying too much or offering too little. To the embarrassment of all concerned.

Trump Wins! Trump Wins! A Diversity Lesson For Data Scientists

Wednesday, November 9th, 2016

Here’s Every Major Poll That Got Donald Trump’s Election Win Wrong by Brian Flood.

From the post:

When Donald Trump shocked the world to become the president-elect on Tuesday night, the biggest loser wasn’t his opponent Hillary Clinton, it was the polling industry that tricked America into thinking we’d be celebrating the first female president right about now.

The polls, which Trump has been calling inaccurate and rigged for months, made it seem like Clinton was a lock to occupy the White House come January.

Nate Silver’s FiveThirtyEight is supposed to specialize in data-based journalism, but the site reported on Tuesday morning that Clinton had a 71.4 percent chance of winning the election. The site was wrong about the outcome in major battleground states including Florida, North Carolina and Pennsylvania, and Trump obviously won the election in addition to the individual states that were supposed to vote Clinton. Silver wasn’t the only pollster to botch the 2016 election.

Trump’s victory should teach would be data scientists this important lesson:

Diversity is important in designing data collection

Some of the reasons given for the failure of prediction in this election:

  1. People without regular voting records voted.
  2. People polled weren’t honest about their intended choices.
  3. Pollster’s weren’t looking for a large, angry segment of the population.

All of which can be traced back to a lack of imagination/diversity in the preparation of the polling instruments.

Ironic isn’t it?

Strive for diversity, including people whose ideas you find distasteful.

Such as vocal Trump supporters. (Substitute your favorite villain.)

Predicting American Politics

Saturday, September 3rd, 2016

Presidential Election Predictions 2016 (an ASA competition) by Jo Hardin.

From the post:

In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:

To get you started, I’ve written an analysis of data scraped from The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight function?). Or you might try some kind of model – either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!

Interesting contest but it is limited to high school and college students. Separate prizes, one for high school and one for college, $200.00 each. Oh, plus ASA memberships and a 2016 Election Prediction t-shirt.

For adults in the audience, strike up a prediction pool by state and/or for the nation.

The Ethics of Data Analytics

Sunday, August 21st, 2016

The Ethics of Data Analytics by Kaiser Fung.

Twenty-one slides on ethics by Kaiser Fung, author of: Junk Charts (data visualization blog), and Big Data, Plainly Spoken (comments on media use of statistics).

Fung challenges you to reach your own ethical decisions and acknowledges there are a number of guides to such decision making.

Unfortunately, Fung does not include professional responsibility requirements, such as the now out-dated Canon 7 of the ABA Model Code Of Professional Responsibility:

A Lawyer Should Represent a Client Zealously Within the Bounds of the Law

That canon has a much storied history, which is capably summarized in Whatever Happened To ‘Zealous Advocacy’? by Paul C. Sanders.

In what became known as Queen Caroline’s Case, the House of Lords sought to dissolve the marriage of King George the IV

George IV 1821 color

to Queen Caroline


on the grounds of her adultery. Effectively removing her as queen of England.

Queen Caroline was represented by Lord Brougham, who had evidence of a secret prior marriage by King George the IV to Catholic (which was illegal), Mrs Fitzherbert.

Portrait of Mrs Maria Fitzherbert, wife of George IV

Brougham’s speech is worth your reading in full but the portion most often cited for zealous defense reads as follows:

I once before took leave to remind your lordships — which was unnecessary, but there are many whom it may be needful to remind — that an advocate, by the sacred duty of his connection with his client, knows, in the discharge of that office, but one person in the world, that client and none other. To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

The name Mrs. Fitzherbert never slips Lord Brougham’s lips but the House of Lords has been warned that may not remain to be the case, should it choose to proceed. The House of Lords did grant the divorce but didn’t enforce it. Saving fact one supposes. Queen Caroline died less than a month after the coronation of George IV.

For data analysis, cybersecurity, or any of the other topics I touch on in this blog, I take the last line of Lord Brougham’s speech:

To save that client by all expedient means — to protect that client at all hazards and costs to all others, and among others to himself — is the highest and most unquestioned of his duties; and he must not regard the alarm, the suffering, the torment, the destruction, which he may bring upon any other; nay, separating even the duties of a patriot from those of an advocate, he must go on reckless of the consequences, if his fate it should unhappily be, to involve his country in confusion for his client.

as the height of professionalism.

Post-engagement of course.

If ethics are your concern, have that discussion with your prospective client before you are hired.

Otherwise, clients have goals and the task of a professional is how to achieve them. Nothing more.

Developing Expert p-Hacking Skills

Saturday, July 2nd, 2016

Introducing the p-hacker app: Train your expert p-hacking skills by Ned Bicare.

Ned’s p-hacker app will be welcomed by everyone who publishes where p-values are accepted.

Publishers should mandate authors and reviewers to submit six p-hacker app results along with any draft that contains, or is a review of, p-values.

The p-hacker app results won’t improve a draft and/or review, but when compared to the draft, will improve the publication in which it might have appeared.

From the post:

My dear fellow scientists!

“If you torture the data long enough, it will confess.”

This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if it was wrong to do creative data analysis.

In fact, the art of creative data analysis has experienced despicable attacks over the last years. A small but annoyingly persistent group of second-stringers tries to denigrate our scientific achievements. They drag psychological science through the mire.

These people propagate stupid method repetitions; and what was once one of the supreme disciplines of scientific investigation – a creative data analysis of a data set – has been crippled to conducting an empty-headed step-by-step pre-registered analysis plan. (Come on: If I lay out the full analysis plan in a pre-registration, even an undergrad student can do the final analysis, right? Is that really the high-level scientific work we were trained for so hard?).

They broadcast in an annoying frequency that p-hacking leads to more significant results, and that researcher who use p-hacking have higher chances of getting things published.

What are the consequence of these findings? The answer is clear. Everybody should be equipped with these powerful tools of research enhancement!

The art of creative data analysis

Some researchers describe a performance-oriented data analysis as “data-dependent analysis”. We go one step further, and call this technique data-optimal analysis (DOA), as our goal is to produce the optimal, most significant outcome from a data set.

I developed an online app that allows to practice creative data analysis and how to polish your p-values. It’s primarily aimed at young researchers who do not have our level of expertise yet, but I guess even old hands might learn one or two new tricks! It’s called “The p-hacker” (please note that ‘hacker’ is meant in a very positive way here. You should think of the cool hackers who fight for world peace). You can use the app in teaching, or to practice p-hacking yourself.

Please test the app, and give me feedback! You can also send it to colleagues:


Integrated R labs for high school students

Tuesday, June 28th, 2016

Integrated R labs for high school students by Amelia McNamara.

From the webpage:

Amelia McNamara, James Molyneux, Terri Johnson

This looks like a very promising approach for capturing the interests of high school students in statistics and R.

From the larger project, Mobilize, curriculum page:

Mobilize centers its curricula around participatory sensing campaigns in which students use their mobile devices to collect and share data about their communities and their lives, and to analyze these data to gain a greater understanding about their world.Mobilize breaks barriers by teaching students to apply concepts and practices from computer science and statistics in order to learn science and mathematics. Mobilize is dynamic: each class collects its own data, and each class has the opportunity to make unique discoveries. We use mobile devices not as gimmicks to capture students’ attention, but as legitimate tools that bring scientific enquiry into our everyday lives.

Mobilize comprises four key curricula: Introduction to Data Science (IDS), Algebra I, Biology, and Mobilize Prime, all focused on preparing students to live in a data-driven world. The Mobilize curricula are a unique blend of computational and statistical thinking subject matter content that teaches students to think critically about and with data. The Mobilize curricula utilize innovative mobile technology to enhance math and science classroom learning. Mobilize brings “Big Data” into the classroom in the form of participatory sensing, a hands-on method in which students use mobile devices to collect data about their lives and community, then use Mobilize Visualization tools to analyze and interpret the data.

I like the approach of having the student collect their own and process their own data. If they learn to question their own data and processes, hopefully they will ask questions about data processing results presented as “facts.” (Since 2016 is a presidential election year in the United States, questioning claimed data results is especially important.)


Ten Simple Rules for Effective Statistical Practice

Sunday, June 12th, 2016

Ten Simple Rules for Effective Statistical Practice by Robert E. Kass, Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, Nancy Reid (Ciation: Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N (2016) Ten Simple Rules for Effective Statistical Practice. PLoS Comput Biol 12(6): e1004961. doi:10.1371/journal.pcbi.1004961)

From the post:

Several months ago, Phil Bourne, the initiator and frequent author of the wildly successful and incredibly useful “Ten Simple Rules” series, suggested that some statisticians put together a Ten Simple Rules article related to statistics. (One of the rules for writing a PLOS Ten Simple Rules article is to be Phil Bourne [1]. In lieu of that, we hope effusive praise for Phil will suffice.)

I started to copy out the “ten simple rules,” sans the commentary but that would be a disservice to my readers.

Nodding past a ten bullet point listing isn’t going to make your statistics more effective.

Re-write the commentary on all ten rules to apply them to every project. The focusing of the rules on your work will result in specific advice and examples for your field.

Who knows? Perhaps you will be writing a ten simple rule article in your specific field, sans Phil Bourne as a co-author. (Do be sure and cite Phil.)

PS: For the curious: Ten Simple Rules for Writing a PLOS Ten Simple Rules Article by Harriet Dashnow, Andrew Lonsdale, Philip E. Bourne.

Statistical Learning with Sparsity: The Lasso and Generalizations (Free Book!)

Wednesday, January 6th, 2016

Statistical Learning with Sparsity: The Lasso and Generalizations by Trevor Hastie, Robert Tibshirani, and Martin Wainwright.

From the introduction:

I never keep a scorecard or the batting averages. I hate statistics. What I got to know, I keep in my head.

This is a quote from baseball pitcher Dizzy Dean, who played in the major leagues from 1930 to 1947.

How the world has changed in the 75 or so years since that time! Now large quantities of data are collected and mined in nearly every area of science, entertainment, business, and industry. Medical scientists study the genomes of patients to choose the best treatments, to learn the underlying causes of their disease. Online movie and book stores study customer ratings to recommend or sell them new movies or books. Social networks mine information about members and their friends to try to enhance their online experience. And yes, most major league baseball teams have statisticians who collect and analyze detailed information on batters and pitchers to help team managers and players make better decisions.

Thus the world is awash with data. But as Rutherford D. Roger (and others) has said:

We are drowning in information and starving for knowledge.

There is a crucial need to sort through this mass of information, and pare it down to its bare essentials. For this process to be successful, we need to hope that the world is not as complex as it might be. For example, we hope that not all of the 30, 000 or so genes in the human body are directly involved in the process that leads to the development of cancer. Or that the ratings by a customer on perhaps 50 or 100 different movies are enough to give us a good idea of their tastes. Or that the success of a left-handed pitcher against left-handed batters will be fairly consistent for different batters. This points to an underlying assumption of simplicity. One form of simplicity is sparsity, the central theme of this book. Loosely speaking, a sparse statistical model is one in which only a relatively small number of parameters (or predictors) play an important role. In this book we study methods that exploit sparsity to help recover the underlying signal in a set of data.

The delightful style of the authors had me going until they said:

…we need to hope that the world is not as complex as it might be.

What? “…not as complex as it might be?

Law school and academia both train you to look for complexity so “…not as complex as it might be” is as close to apostasy as any statement I can imagine. 😉 (At least I can say I am honest about my prejudices. Some of them at any rate.)

Not for the mathematically faint of heart but it may certainly be a counter to the intelligence communities’ mania about collecting every scrap of data.

Finding a needle in a smaller haystack could be less costly and more effective. Both of those principles run counter to well established government customs but there are those in government who wish to be effective. (Article of faith on my part.)

I first saw this in a tweet by Chris Diehl.

Playboy Exposed [Complete Archive]

Wednesday, December 30th, 2015

Playboy Exposed by Univision’s Data Visualization Unit.

From the post:

The first time Pamela Anderson got naked for a Playboy cover, with a straw hat covering her inner thighs, she was barely 22 years old. It was 1989 and the magazine was starting to favor displaying young blondes on its covers.

On Friday, December 11, 2015, a quarter century later, the popular American model, now 48, graced the historical last nude edition of the magazine, which lost the battle for undress and decided to cover up its women in order to survive.

Univision Noticias analyzed all the covers published in the US, starting with Playboy’s first issue in December 1953, to study the cover models’ physical attributes: hair and skin color, height, age and body measurements. With these statistics, a model of the prototype woman for each decade emerged. It can be viewed in this interactive special.

I’ve heard people say they bought Playboy magazine for the short stories but this is my first time to hear of someone just looking at the covers. 😉

The possibilities for analysis of Playboy and its contents are nearly endless.

Consider the history of “party jokes” or “Playboy Advisor,” not to mention the cartoons in every issue.

I did check the Playboy Store but wasn’t about to find a DVD set with all the issues.

You can subscribe to Playboy Archive for $8.00 a month and access every issue from the first issue to the current one.

I don’t have a subscription so I not sure how you would do the OCR to capture the jokes.

Everything You Know About Latency Is Wrong

Thursday, December 24th, 2015

Everything You Know About Latency Is Wrong by Tyler Treat.

From the post:

Okay, maybe not everything you know about latency is wrong. But now that I have your attention, we can talk about why the tools and methodologies you use to measure and reason about latency are likely horribly flawed. In fact, they’re not just flawed, they’re probably lying to your face.

When I went to Strange Loop in September, I attended a workshop called “Understanding Latency and Application Responsiveness” by Gil Tene. Gil is the CTO of Azul Systems, which is most renowned for its C4 pauseless garbage collector and associated Zing Java runtime. While the workshop was four and a half hours long, Gil also gave a 40-minute talk called “How NOT to Measure Latency” which was basically an abbreviated, less interactive version of the workshop. If you ever get the opportunity to see Gil speak or attend his workshop, I recommend you do. At the very least, do yourself a favor and watch one of his recorded talks or find his slide decks online.

The remainder of this post is primarily a summarization of that talk. You may not get anything out of it that you wouldn’t get out of the talk, but I think it can be helpful to absorb some of these ideas in written form. Plus, for my own benefit, writing about them helps solidify it in my head.

Great post, not only for the discussion of latency but for two extensions to the admonition (Moon is a Harsh Mistress) “Always cut cards:”

  • Always understand the nature of your data.
  • Always understand the nature your methodology.

If you fail at either of those, the results presented to you or that you present to others may or may not be true, false or irrelevant.

Treat’s post is just one example in a vast sea of data and methodologies which are just as misleading if not more so.

If you need motivation to put in the work, how’s your comfort level with being embarrassed in public? Like someone demonstrating your numbers are BS.

Estimating “known unknowns”

Saturday, December 12th, 2015

Estimating “known unknowns” by Nick Berry.

From the post:

There’s a famous quote from former Secretary of Defense Donald Rumsfeld:

“ … there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”

I write this blog. I’m an engineer. Whilst I do my best and try to proof read, often mistakes creep in. I know there are probably mistakes in just about everything I write! How would I go about estimating the number of errors?

The idea for this article came from a book I recently read by Paul J. Nahin, entitled Duelling Idiots and Other Probability Puzzlers (In turn, referencing earlier work by the eminent mathematician George Pólya).

Proof Reading2

Imagine I write a (non-trivially short) document and give it to two proof readers to check. These two readers (independantly) proof read the manuscript looking for errors, highlighting each one they find.

Just like me, these proof readers are not perfect. They, also, are not going to find all the errors in the document.

Because they work independently, there is a chance that reader #1 will find some errors that reader #2 does not (and vice versa), and there could be errors that are found by both readers. What we are trying to do is get an estimate for the number of unseen errors (errors detected by neither of the proof readers).*

*An alternate way of thinking of this is to get an estimate for the total number of errors in the document (from which we can subtract the distinct number of errors found to give an estimate to the number of unseen errros.

A highly entertaining posts on estimating “known unknowns,” such as the number of errors in a paper that has been proofed by two independent proof readers.

Of more than passing interest to me because I am involved in a New Testament Greek Lexicon project that is an XML encoding of a 500+ page Greek lexicon.

The working text is in XML, but not every feature of the original lexicon was captured in markup and even if that were true, we would still want to improve upon features offered by the lexicon. All of which depend upon the correctness of the original markup.

You will find Nick’s analysis interesting and more than that, memorable. Just in case you are asked about “estimating ‘known unknowns'” in a data science interview.

Only Rumsfeld could tell you how to estimate an “unknown unknowns.” I think it goes: “Watch me pull a number out of my ….”


I was found this post by following another post at this site, which was cited by Data Science Renee.

What’s the significance of 0.05 significance?

Tuesday, November 24th, 2015

What’s the significance of 0.05 significance? by Carl Anderson.

From the post:

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I’ve seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I’ve seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub.

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This “standard” 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

One of the best history/explanations of 0.05 significance I have ever read. Highly recommended!

In part because in the retelling of this story Carl includes references that will allow you to trace the story in even greater detail.

What is dogma today, 0.05 significance, started as a convention among scientists, without theory, without empirical proof, without any of gate keepers associated with scientific publishing of today.

Over time 0.05 significance has proved its utility. The question for you is what other dogmas of today rely on the chance practices of yesteryear?

I first saw this in a tweet by Kirk Borne.

…Whether My Wife is Pregnant or Not

Friday, November 6th, 2015

A Bayesian Model to Calculate Whether My Wife is Pregnant or Not by Rasmus Bååth.

From the post:

On the 21st of February, 2015, my wife had not had her period for 33 days, and as we were trying to conceive, this was good news! An average period is around a month, and if you are a couple trying to go triple, then a missing period is a good sign something is going on. But at 33 days, this was not yet a missing period, just a late one, so how good news was it? Pretty good, really good, or just meh?

To get at this I developed a simple Bayesian model that, given the number of days since your last period and your history of period onsets, calculates the probability that you are going to be pregnant this period cycle. In this post I will describe what data I used, the priors I used, the model assumptions, and how to fit it in R using importance sampling. And finally I show you why the result of the model really didn’t matter in the end. Also I’ll give you a handy script if you want to calculate this for yourself. 🙂

I first saw this post in a tweet by Neil Saunders who commented:

One of the clearest Bayesian methods articles I’ve read (I struggle with most of them)

I agree with Neil’s assessment and suspect you will as well.

Unlike most Bayesian methods articles, this one also has a happy ending.

You should start following Rasmus Bååth on Twitter.

Statistical Reporting Errors in Psychology (1985–2013) [1 in 8]

Tuesday, October 27th, 2015

Do you remember your parents complaining about how far the latest psychology report departed from their reality?

Turns out there may be a scientific reason why those reports were as far off as your parents thought (or not).

The prevalence of statistical reporting errors in psychology (1985–2013) by Michèle B. Nuijten , Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, Jelte M. Wicherts, reports:

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

This is an open access article so dig in for all the details discovered by the authors.

The R package statcheck: Extract Statistics from Articles and Recompute P Values is quite amazing. The manual for statcheck should have you up and running in short order.

I did puzzle over the proposed solutions:

Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

All of those are good suggestions but we already have the much valued process of “peer review” and the value-add of both non-profit and commercial publishers. Surely those weighty contributions to the process of review and publication should be enough to quell this “…systematic bias in favor of significant results.”

Unless, of course, dependence on “peer review” and the value-add of publishers for article quality is entirely misplaced. Yes?

What area with “p-values reported as significant” will fall to statcheck next?

Some key Win-Vector serial data science articles

Wednesday, October 7th, 2015

Some key Win-Vector serial data science articles by John Mount.

From the post:

As readers have surely noticed the Win-Vector LLC blog isn’t a stream of short notes, but instead a collection of long technical articles. It is the only way we can properly treat topics of consequence.

  • Statistics to English translation.

    This series tries to find vibrant applications and explanations of standard good statistical practices, to make them more approachable to the non statistician.

  • Statistics as it should be.

    This series tries to cover cutting edge machine learning techniques, and then adapt and explain them in traditional statistical terms.

  • R as it is.

    This series tries to teach the statistical programming language R “warts and all” so we can see it as the versatile and powerful data science tool that it is.

More than enough reasons to start haunting the the Win-Vector LLC blog on a regular basis.

Perhaps an inspiration to do more long-form posts as well.

Math for Journalists Made Easy:…

Wednesday, May 20th, 2015

Math for Journalists Made Easy: Understanding and Using Numbers and Statistics – Sign up now for new MOOC

From the post:

Journalists who squirm at the thought of data calculation, analysis and statistics can arm themselves with new reporting tools during the new Massive Open Online Course (MOOC) from the Knight Center for Journalism in the Americas: “Math for Journalists Made Easy: Understanding and Using Numbers and Statistics” will be taught from June 1 to 28, 2015.

Click here to sign up and to learn more about this free online course.

“Math is crucial to things we do every day. From covering budgets to covering crime, we need to understand numbers and statistics,” said course instructor Jennifer LaFleur, senior editor for data journalism for the Center for Investigative Reporting, one of the instructors of the MOOC.

Two other instructors will be teaching this MOOC: Brant Houston, a veteran investigative journalist who is a professor and the Knight Chair in Journalism at the University of Illinois; and freelance journalists Greg Ferenstein, who specializes in the use of numbers and statistics in news stories.

The three instructors will teach journalists “how to be critical about numbers, statistics and research and to avoid being improperly swayed by biased researchers.” The course will also prepare journalists to relay numbers and statistics in ways that are easy for the average reader to understand.

“It is true that many of us became journalists because sometime in our lives we wanted to escape from mathematics, but it is also true that it has never been so important for journalists to overcome any fear or intimidation to learn about numbers and statistics,” said professor Rosental Alves, founder and director of the Knight Center. “There is no way to escape from math anymore, as we are nowadays surrounded by data and we need at least some basic knowledge and tools to understand the numbers.”

The MOOC will be taught over a period of four weeks, from June 1 to 28. Each week focuses on a particular topic taught by a different instructor. The lessons feature video lectures and are accompanied by readings, quizzes and discussion forums.

This looks excellent.

I will be looking forward to very tough questions of government and corporate statistical reports from anyone who takes this course.

“The ultimate goal is evidence-based data analysis”

Monday, May 4th, 2015

Statistics: P values are just the tip of the iceberg by Jeffrey T. Leek & Roger D. Peng.

From the summary:

Ridding science of shoddy statistics will require scrutiny of every step, not merely the last one, say Jeffrey T. Leek and Roger D. Peng.

From the post:


Leek and Peng are right but I would shy away from ever claiming “…evidence-based data analysis.”

You can disclose the choices you make at every stage of the data pipeline but the result isn’t “…evidence-based data analysis.”

I say that because “…evidence-based data analysis” implies that whatever the result, human agency wasn’t a factor in it. On the contrary, an ineffable part of human judgement is a part of every data analysis.

The purpose of documenting the details of each step is to enable discussion and debate about the choices made in the process.

Just as I object to politicians wrapping themselves in national flags, I equally object to anyone wrapping themselves in “evidence/facts” as though they and only they possess them.

Selection bias and bombers

Monday, April 13th, 2015

Selection bias and bombers

John D. Cook didn’t just recently start having interesting opinions! This is a post from 2008 that starts:

During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their bombers. After analyzing the records, he recommended adding more armor to the places where there was no damage!

A great story of how the best evidence may not be right in front of us.


Teaching and Learning Data Visualization: Ideas and Assignments

Sunday, March 15th, 2015

Teaching and Learning Data Visualization: Ideas and Assignments by Deborah Nolan, Jamis Perrett.


This article discusses how to make statistical graphics a more prominent element of the undergraduate statistics curricula. The focus is on several different types of assignments that exemplify how to incorporate graphics into a course in a pedagogically meaningful way. These assignments include having students deconstruct and reconstruct plots, copy masterful graphs, create one-minute visual revelations, convert tables into `pictures’, and develop interactive visualizations with, e.g., the virtual earth as a plotting canvas. In addition to describing the goals and details of each assignment, we also discuss the broader topic of graphics and key concepts that we think warrant inclusion in the statistics curricula. We advocate that more attention needs to be paid to this fundamental field of statistics at all levels, from introductory undergraduate through graduate level courses. With the rapid rise of tools to visualize data, e.g., Google trends, GapMinder, ManyEyes, and Tableau, and the increased use of graphics in the media, understanding the principles of good statistical graphics, and having the ability to create informative visualizations is an ever more important aspect of statistics education.

You will find a number of ideas in this paper to use in teaching and learning visualization.

I understand that visualizing a table can, with the proper techniques, display relationships that are otherwise difficult to notice.

On the other hand, due to our limited abilities to distinguish colors, graphs can conceal information that would otherwise be apparent from a table.

Not an objection to visualizing tables but a caution that details can get lost in visualization as well as being highlighted for the viewer.

Speaking of Numbers and Big Data Disruption

Thursday, March 12th, 2015

Survey: Big Data is Disrupting Business as Usual by George Leopold.

From the post:

Sixty-four percent of the enterprises surveyed said big data is beginning to change the traditional boundaries of their businesses, allowing more agile providers to grab market share. More than half of those surveyed said they are facing greater competition from “data-enabled startups” while 27 percent reported competition from new players from other industries.

Hence, enterprises slow to embrace data analytics are now fretting over their very survival, EMC and the consulting firm argued.

Those fears are expected to drive investment in big data over the next three years, with 54 percent of respondents saying they plan to increase investment in big data tools. Among those who have already made big data investments, 61 percent said data analytics are already driving company revenues. The fruits of these big data efforts are proving as valuable as existing products and services, the survey found.

That sounds important, except they never say how business is being disrupted? Seems like that would be an important point to make. Yes?

And note the 61% who “…said data analytics are already driving company revenues…” are “…among those who have already made big data investments….” Was that ten people? Twenty? And who after making a major investment is going to say that it sucks?

The survey itself sounds suspect if you read the end of the post:

Capgemini said its big data report is based on an online survey conducted in August 2014 of more than 1,000 senior executives across nine industries in ten global markets. Survey author FreshMinds also conducted follow-up interviews with some respondents.

I think there is a reason that Gallup and those sort of folks don’t do online surveys. It has something to do with accuracy if I recall correctly. 😉

Making Statistics Files Accessible

Sunday, March 8th, 2015

Making Statistics Files Accessible by Evan Miller.

From the post:

There’s little in modern society more frustrating than receiving a file from someone and realizing you’ll need to buy a jillion-dollar piece of software in order to open it. It’s like, someone just gave you a piece of birthday cake, but you’re only allowed to eat that cake with a platinum fork encrusted with diamonds, and also the fork requires you to enter a serial number before you can use it.

Wizard often receives praise for its clean statistics interface and beautiful design, but I’m just as proud of another part of the software that doesn’t receive much attention, ironically for the very reason that it works so smoothly: the data importers. Over the last couple of years I’ve put a lot of effort into understanding and picking apart various popular file formats; and as a result, Wizard can slurp down Excel, Numbers, SPSS, Stata, and SAS files like it was a bowl of spaghetti at a Shoney’s restaurant.

Of course, there are a lot of edge cases and idiosyncrasies in binary files, and it takes a lot of mental effort to keep track of all the peculiarities; and to be honest I’d rather spend that effort making a better interface instead of bashing my head against a wall over some binary flag field that I really, honestly have no interest in learning more about. So today I’m happy to announce that the file importers are about to get even smoother, and at the same time, I’ll be able to put more of my attention on the core product rather than worrying about file format issues.

The astute reader will ask: how will a feature that starts receiving less attention from me get better? It’s simple: I’ve open-sourced Wizard’s core routines for reading SAS, Stata, and SPSS files, and as of today, these routines are available to anyone who uses R — quite a big audience, which means that many more people will be available to help me diagnose and fix issues with the file importers.

In case you don’t recognize the Wizard software, there’s a reason the site has “mac” in its name: 😉

7 Traps to Avoid Being Fooled by Statistical Randomness

Monday, February 16th, 2015

7 Traps to Avoid Being Fooled by Statistical Randomness by Kirk Borne.

From the post:

Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere — if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.

Unfortunately, we are often fooled by random events whenever apparent order emerges in the system. In moments of statistical weakness, some folks even develop theories to explain such “ordered” patterns. However, if the events are truly random, then any correlation is purely coincidental and not causal. I remember learning in graduate school a simple joke about erroneous scientific data analysis related to this concept: “Two points in a monotonic sequence display a tendency. Three points in a monotonic sequence display a trend. Four points in a monotonic sequence define a theory.” The message was clear — beware of apparent order in a random process, and don’t be tricked into developing a theory to explain random data.

Suppose I have a fair coin (with a head or a tail being equally likely to appear when I toss the coin). Of the following 3 sequences (each representing 12 sequential tosses of the fair coin), which sequence corresponds to a bogus sequence (i.e., a sequence that I manually typed on the computer)?




(d) None of the above.

In each case, a coin toss of head is listed as “H”, and a coin toss of tail is listed as “T”.

The answer is “(d) None of the Above.”

None of the above sequences was generated manually. They were all actual subsequences extracted from a larger sequence of random coin tosses. I admit that I selected these 3 subsequences non-randomly (which induces a statistical bias known as a selection effect) in order to try to fool you. The small-numbers phenomenon is evident here — it corresponds to the fact that when only 12 coin tosses are considered, the occurrence of any “improbable result” may lead us (incorrectly) to believe that it is statistically significant. Conversely, if we saw answer (b) continuing for dozens of more coin tosses (nothing but Tails, all the way down), then that would be truly significant.

Great post on randomness where Kirk references a fun example using Nobel Prize winners with various statistical “facts” for your amusement.

Kirk suggests a reading pack for partial avoidance of this issue in your work:

  1. Fooled By Randomness“, by Nassim Nicholas Taleb.
  2. The Flaw of Averages“, by Sam L. Savage.
  3. The Drunkard’s Walk – How Randomness Rules Our Lives, by Leonard Mlodinow.

I wonder if you could get Amazon to create a also-bought-with package of those three books? Something you could buy for your friends in big data and intelligence work. 😉

Interesting that I saw this just after posting Structuredness coefficient to find patterns and associations. The call on “likely” or “unlikely” comes down to human agency. Yes?