Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey.
The start of a series of posts on predictive coding and searching of the Enron emails by a lawyer. A legal perspective is important enough that I will be posting a note about each post in this series as they occur.
A couple of preliminary notes:
I am sure this is the first time that Ralph has used predictive encoding with the Enron emails. On the other hand, I would not take “…this is the first time for X…” sort of claims from any vendor or service organization. 😉
You can see other examples of processing the Enron emails at:
- Getting Started on Hadoop Presentation, Hadoop, Python.
- The Data Lifecycle, Part One: Avroizing the Enron Emails Russell Jurney’s series on analyzing the Enron emails.
- Using MapReduce to process the Enron email dataset by Kevin Brownell and Jud Porter.
- Parsing the Enron email dataset using Tika and Hadoop
And that is just a “lite” scan. There are numerous other projects that use the Enron email collection.
I wonder if that is because we are naturally nosey?
From the post:
This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.
Joys and Risks of Being First
To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.
There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies (sorry Jason).
A search project like this takes an enormous amount of time. In fact, to my knowledge (Maura, please correct me if I’m wrong), no Legal Track TRECkies have ever recorded and reported the time that they put into the project, although there are rumors. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.
There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:
It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.
I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.
[…] Day One..Predictive Coding Narrative: Searching for Relevance…Ashes of Enron #topicmaps #coding #ediscovery – http://t.co/GuFPrAR6… […]
Pingback by Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron « Another Word For It | Digital Evidence and Discovery (DEAD) | Scoop.it — July 16, 2012 @ 7:58 pm