Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.
From the post:
Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.
More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.
If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.
Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.