Estimating “known unknowns” by Nick Berry.
From the post:
There’s a famous quote from former Secretary of Defense Donald Rumsfeld:
“ … there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”
I write this blog. I’m an engineer. Whilst I do my best and try to proof read, often mistakes creep in. I know there are probably mistakes in just about everything I write! How would I go about estimating the number of errors?
The idea for this article came from a book I recently read by Paul J. Nahin, entitled Duelling Idiots and Other Probability Puzzlers (In turn, referencing earlier work by the eminent mathematician George Pólya).
Proof Reading2
Imagine I write a (non-trivially short) document and give it to two proof readers to check. These two readers (independantly) proof read the manuscript looking for errors, highlighting each one they find.
Just like me, these proof readers are not perfect. They, also, are not going to find all the errors in the document.
Because they work independently, there is a chance that reader #1 will find some errors that reader #2 does not (and vice versa), and there could be errors that are found by both readers. What we are trying to do is get an estimate for the number of unseen errors (errors detected by neither of the proof readers).*
…
*An alternate way of thinking of this is to get an estimate for the total number of errors in the document (from which we can subtract the distinct number of errors found to give an estimate to the number of unseen errros.
…
A highly entertaining posts on estimating “known unknowns,” such as the number of errors in a paper that has been proofed by two independent proof readers.
Of more than passing interest to me because I am involved in a New Testament Greek Lexicon project that is an XML encoding of a 500+ page Greek lexicon.
The working text is in XML, but not every feature of the original lexicon was captured in markup and even if that were true, we would still want to improve upon features offered by the lexicon. All of which depend upon the correctness of the original markup.
You will find Nick’s analysis interesting and more than that, memorable. Just in case you are asked about “estimating ‘known unknowns'” in a data science interview.
Only Rumsfeld could tell you how to estimate an “unknown unknowns.” I think it goes: “Watch me pull a number out of my ….”
😉
I was found this post by following another post at this site, which was cited by Data Science Renee.