Archive for the ‘Email’ Category

Email Address Vacuuming – Infoga

Wednesday, January 31st, 2018

Infoga – Email Information Gathering

From the post:

Infoga is a tool for gathering e-mail accounts information (ip,hostname,country,…) from different public sources (search engines, pgp key servers). Is a really simple tool, but very effective for the early stages of a penetration test or just to know the visibility of your company in the Internet.

Its not COMINT:

COMINT or communications intelligence is intelligence gained through the interception of foreign communications, excluding open radio and television broadcasts. It is a subset of signals intelligence, or SIGINT, with the latter being understood as comprising COMINT and ELINT, electronic intelligence derived from non-communication electronic signals such as radar. (COMINT (Communications Intelligence))

as practiced by the NSA, but that doesn’t keep it from being useful.

Not gathering useless data means a smaller haystack and a greater chance of finding needles.

Other focused information mining tools you would recommend?

OnionShare – Safely Sharing Email Leaks – 394 Days To Mid-terms

Sunday, October 8th, 2017

FiveThirtyEight concludes Clinton’s leaked emails had some impact on the 2016 presidential election, but can’t say how much. How Much Did WikiLeaks Hurt Hillary Clinton?

Had leaked emails been less boring and non-consequential, “smoking gun” sort of emails, their impact could have been substantial.

The lesson being the impact of campaign/candidate/party emails is impossible to judge until they have been leaked. Even then the impact may be uncertain.

“Leaked emails” presumes someone has leaked the emails, which in light of the 2016 presidential election, is a near certainty for the 2018 congressional mid-term elections.

Should you find yourself in possession of leaked emails, you may want a way to share them with others. My preference for public posting without edits or deletions, but not everyone shares my confidence in the public.

One way to share files securely and anonymously with specific people is OnionShare.

From the wiki page:

What is OnionShare?

OnionShare lets you securely and anonymously share files of any size. It works by starting a web server, making it accessible as a Tor onion service, and generating an unguessable URL to access and download the files. It doesn’t require setting up a server on the internet somewhere or using a third party filesharing service. You host the file on your own computer and use a Tor onion service to make it temporarily accessible over the internet. The other user just needs to use Tor Browser to download the file from you.

How to Use

http://asxmi4q6i7pajg2b.onion/egg-cain. This is the secret URL that can be used to download the file you’re sharing.

Send this URL to the person you’re sending the files to. If the files you’re sending aren’t secret, you can use normal means of sending the URL, like by emailing it, or sending it in a Facebook or Twitter private message. If you’re sending secret files then it’s important to send this URL securely.

The person who is receiving the files doesn’t need OnionShare. All they need is to open the URL you send them in Tor Browser to be able to download the file.
(emphasis in original)

Download OnionShare 1.1. Versions are available for Windows, Mac OS X, with instructions for Ubuntu, Fedora and other flavors of Linux.

Caveat: If you are sending a secret URL to leaked emails or other leaked data, use ordinary mail, no return address, standard envelope from a package of them you discard, on the back of a blank counter deposit slip, with letters from a newspaper, taped in the correct order, sent to the intended recipient. (No licking, it leaves trace DNA.)

Those are the obvious security points about delivering a secret URL. Take that as a starting point.

PS: I would never contact the person chosen for sharing about shared emails. They can be verified separate and apart from you as the source. Every additional contact puts you in increased danger of becoming part of a public story. What they don’t know, they can’t tell.

99% of UK Law Firms Ripe For Email Fraud

Thursday, September 21st, 2017

The actual title of the report is: Addressing Cyber Risks Identified in the SRA Risk Outlook Report 2016/17. Yawn. Not exactly an attention grabber.

The report does have this nifty graphic:

The Panama Papers originated from a law firm.

Have you ever wondered what the top 100 law firms in the UK must be hiding?

Or any of the other 10,325 law firms operating in the UK? (Total number of law firms: 10,425.)

If hackers feasting on financial fraud develop a sense of public duty, radical transparency will not be far behind.

Conclusive Reason To NOT Use Gmail

Thursday, April 20th, 2017

Using an email service, Gmail for example, that tracks (and presumably reads) your incoming and outgoing mail is poor security judgement.

Following a California magistrate ruling on 19 April 2017, it’s suicidal.

Shaun Nichols covers the details in Nuh-un, Google, you WILL hand over emails stored on foreign servers, says US judge.

But the only part of the decision that should interest you reads:

The court denies Google’s motion to quash the warrant for content that it stores outside the United States and orders it to produce all content responsive to the search warrant that is retrievable from the United States, regardless of the data’s actual location.

Beeler takes heart from the dissents in In the Matter of a Warrant to Search a Certain E-Mail Account Controlled & Maintained by Microsoft Corp., 829 F.3d 197 (2d Cir. 2016), reh’g denied en banc, No. 14-2985, 2017 WL 362765 (2d Cir. Jan. 24, 2017), to find if data isn’t intentionally stored outside the US, and can be accessed from within the US, then its subject to a warrant under 18 U.S.C. § 2703(a), the Stored Communications Act (“SCA”).

I have a simpler perspective: Do you want to risk fortune and freedom on a how many angels can dance on the head of 18 U.S.C. § 2703(a), the Stored Communications Act (“SCA”) questions?

If your answer is no, don’t use Gmail. Or any other service where data can be accessed from United States for 18 U.S.C. § 2703(a), but similar statutes for other jurisdictions.

For that matter, prudent users restrict themselves to Tor based mail services and always use strong encryption.

Almost any communication can be taken as a crime or step in a conspiracy by a prosecutor inclined to do so.

The only partially safe haven is silence. (Where encryption and/or inability to link you to the encrypted communication = silence.)

9,477 DKIM Verified Clinton/Podesta Emails (of 39,878 total (today))

Monday, October 31st, 2016

Still working on the email graph and at the same time, managed to catch up on the Clinton/Podesta drops by Michael Best, @NatSecGeek, at least for a few hours.

DKIM-verified-podesta-1-24.txt.gz is a sub-set of 9,477 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-24.txt.gz is the full set of Podesta emails to date, some 39,878, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Question: Have you seen any news reports that mention emails being “verified” in their reporting?

Emails in the complete set may be as accurate as those in the verified set, but I would think verification is newsworthy in and of itself.


Parsing Emails With Python, A Quick Tip

Monday, October 31st, 2016

While some stuff runs in the background, a quick tip on parsing email with Python.

I got the following error message from Python:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 301, in parse
res = self._parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 349, in _parse
l = _timelex.split(timestr)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 143, in split
return list(cls(s))
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 137, in next
token = self.get_token()
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 68, in get_token
nextchar =
AttributeError: ‘NoneType’ object has no attribute ‘read’

I have edited the email header in question but it reproduces the original error:

Received: by with SMTP id w14cs34683wfw;
Wed, 5 Nov 2008 08:11:39 -0800 (PST)
Received: by with SMTP id r1mr728791wad.136.1225901498795;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received: from ( [])
by with ESMTP id m26si29354pof.3.2008.;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received-SPF: pass ( domain of designates
Received: from ([])
by with comcast
id bUBY1a0010b6N64A9UBeJl; Wed, 05 Nov 2008 16:11:38 +0000
Received: from ([])
by with comcast
id bUAV1a00L2JMgtY8PUAV7G; Wed, 05 Nov 2008 16:10:30 +0000
X-Authority-Analysis: v=1.0 c=1 a=1Ht49J2nGmlg0oY3xr8A:9
a=8nxvWDfACCTtBObdks-tTUtrMyYA:4 a=OA_lqj45gZcA:10 a=diNjy0DT58-4uIkuavEA:9
a=e0_VUgpf8QEu0XMU188OmzzKrzoA:4 a=37WNUvjkh6kA:10
Received: from [] by;
Wed, 05 Nov 2008 16:10:28 +0000

To: “Podesta” ,
CC: “Denis McDonough OFA” ,”,,
Subject: DOD leadership – immediate attention
Date: Wed, 05 Nov 2008 16:10:28 +0000
Message-Id: <110520081610.3048.4911C574000C2E2100000BE82216>
X-Mailer: AT&T Message Center Version 1 (Oct 30 2007)
X-Authenticated-Sender: c2V3YWxsY29ucm95QGNvbWNhc3QubmV0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=”NextPart_Webmail_9m3u9jl4l_3048_1225901428_0″

Content-Type: text/plain
Content-Transfer-Encoding: 8bit

I’m comparing “Date” to similar emails and getting no joy.

Absence is hard to notice, but once you know the rule, it’s obvious:

RFC822: Standard for ARPA Internet Text Messages says in part:

3. Lexical Analysis of Messages

3.1 General Description

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). (emphasis added)

Yep, the blank line I introduced while removing an errant double-quote on a line by itself, created the start for the body of the message.

Meaning that my Python script failed to find the “Date:” field and returning what someone thought would be a useful error message.

When you get errors parsing emails with Python (and I assume in other languages), check the format of your messages!

RFC822 has an appendix of parsing rules and a few examples.

Suggested listings of the most common email/email header format errors?

Email Write Order

Friday, April 8th, 2016

Email Write Order by David Sparks.

This is more of a reminder to me than you but I pass it along in case you are interested.

From the post:

I’ve always had a gripe with email application developers concerning the way they want us to write emails. When you go to write an email, the tab order is all out of whack.

The default write order starts out with you selecting the recipient for your message, which makes enough sense, but then everything goes off the rails. Next, it wants you to type in the subject line for a message you haven’t written yet. Because you haven’t written the message, there is a bit of mental friction between us getting our thoughts together and making a cogent subject line at that time, so we skip it or just leave it with whatever the mail client added (e.g., “re: re: re: re: re: That Thing”).

Next, the application wants you to write the body of your message. Rarely does the application even prompt you to add an attachment, which means about half the time you’ll forget to add an attachment. Because the default write order is all out of whack, so are the messages we often send using it. It makes a lot more sense to add attachments next and then write the body of the message before filling out the subject line and sending. I’ve got an alternative write order that makes a lot more sense.

Suggestion: Try David’s alternative order for the next twenty or so emails you send.

Whether you bite on his $9.99 Email Field Guide or not, I think you will find the alternate order a useful exercise.


Data Modeling – FoundationDB

Saturday, February 15th, 2014

Data Modeling – FoundationDB

From the webpage:

FoundationDB’s core provides a simple data model coupled with powerful transactions. This combination allows building richer data models and libraries that inherit the scalability, performance, and integrity of the database. The goal of data modeling is to design a mapping of data to keys and values that enables effective storage and retrieval. Good decisions will yield an extensible, efficient abstraction. This document covers the fundamentals of data modeling with FoundationDB.

Great preparation for these tutorials using the tuple layer of FoundationDB:

The Class Scheduling tutorial introduces the fundamental concepts needed to design and build a simple application using FoundationDB, beginning with basic interaction with the database and walking through a few simple data modeling techniques.

The Enron Email Corpus tutorial introduces approaches to loading data in FoundationDB and further illustrates data modeling techniques using a well-known, publicly available data set.

The Managing Large Values and Blobs tutorial discusses approaches to working with large data objects in FoundationDB. It introduces the blob layer and illustrates its use to build a simple file library.

The Lightweight Query Language tutorial discusses a layer that allows Datalog to be used as an interactive query language for FoundationDB. It describes both the FoundationDB binding and the use of the query language itself.


Enron, Email, Kiji, Hive, YARN, Tez (Jan. 7th, DC)

Monday, January 6th, 2014

Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez Hadoop-DC.

Tuesday, January 7, 2014 6:00 PM to 9:30 PM
Neustar (Room: Neuview) 21575 Ridgetop Circle, Sterling, VA

From the webpage:

Exploring Enron Email Dataset with Kiji and Hive

Lee Sheng, WibiData

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop that provides SQL based access for exploring datasets. KijiSchema provides evolvable schemas of primitive and compound types on top of HBase. The integration between these provides the best aspects of both worlds (ad hoc SQL based querying on top of datasets using evolvable schemas containing complex objects). This talk will present an examples of queries utilizing this integration to do exploratory analysis of the Enron email corpus. Delving into topics such as email responder pairs and sentiment analysis can expose many of the interesting points in the rise and fall of Enron.

Apache YARN & Apache Tez

Tom McCuch Technical Director, Hortonworks

Apache Hadoop has become synonymous with Big Data and powers large scale data processing across some of the biggest companies in the world. Hadoop 2 is the next generation release of Hadoop and marks a pivotal point in its maturity with YARN – the new Hadoop compute framework. YARN – Yet Another Resource Negotiator – is a complete re-architecture of the Hadoop compute stack with a clean separation between platform and application. This opens up Hadoop data processing to new applications that can be executed IN Hadoop instead of outside Hadoop, thus improving efficiency, performance, data sharing and lowering operation costs. The Big Data ecosystem is already converging on YARN with new applications like Apache Tez being written specifically for YARN. Apache Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. The talk will provide a brief overview of key Hadoop 2 innovations, focusing in on YARN and Tez – covering architecture, motivational use cases and future roadmap. Finally, the impact of YARN on the Hadoop community will be demonstrated through running interactive queries with both Hive on Tez and with Hive on MapReduce, and comparing their performance side-by-side on the same Hadoop 2 cluster.

When I saw the low tomorrow in DC is going to be 16F and the high 21F, I thought I should pass this along.

Does anyone have a very large set of phone metadata that is public?

Thinking rather than grinding over Enron’s stumbles, again, phone metadata could be hands-on training for a variety of careers. 😉

Looking forward to seeing videos of these presentations!

Email Indexing Using Cloudera Search [Stepping Beyond “Hello World”]

Thursday, September 26th, 2013

Email Indexing Using Cloudera Search by Jeff Shmain

From the post:

Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?

Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.

A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.

In this post, you’ll learn how to set up Apache Flume for near-real-time indexing and MapReduce for batch indexing of email documents. Note that although this post focuses on email data, there is no reason why the same concepts could not be applied to instant messages, voice transcripts, or any other data (both structured and unstructured).

If you want a beyond “Hello World” introduction to: Flume, Solr, Cloudera Morphlines, HDFS, Hue’s Search application, and Cloudera Search, this is the post for you.

With the added advantage that you can apply the basic principles in this post as you expand your knowledge of the Hadoop ecosystem.

Targeting Phishing Victims

Friday, July 26th, 2013

Profile of Likely E-mail Phishing Victims Emerges in Human Factors/Ergonomics Research

From the webpage:

The author of a paper to be presented at the upcoming 2013 International Human Factors and Ergonomics Society Annual Meeting has described behavioral, cognitive, and perceptual attributes of e-mail users who are vulnerable to phishing attacks. Phishing is the use of fraudulent e-mail correspondence to obtain passwords and credit card information, or to send viruses.

In “Keeping Up With the Joneses: Assessing Phishing Susceptibility in an E-mail Task,” Kyung Wha Hong, Christopher M. Kelley, Rucha Tembe, Emergson Murphy-Hill, and Christopher B. Mayhorn, discovered that people who were overconfident, introverted, or women were less able to accurately distinguish between legitimate and phishing e-mails. She had participants complete a personality survey and then asked them to scan through both legitimate and phishing e-mails and either delete suspicious or spam e-mails, leave legitimate e-mails as is, or mark e-mails that required actions or responses as “important.”

“The results showed a disconnect between confidence and actual skill, as the majority of participants were not only susceptible to attacks but also overconfident in their ability to protect themselves,” says Hong. Although 89% of the participants indicted they were confident in their ability to identify malicious e-mails, 92% of them misclassified phishing e-mails. Almost 52% in the study misclassified more than half the phishing e-mails, and 54% deleted at least one authentic e-mail.

I would say that “behavioral, cognitive, and perceptual attributes” are a basis for identifying users. Or at least a certain type of users as a class.

Or to put it another way, a class of users is just as much a subject for discussion in a topic map as any of user individually.

It may be more important, either for targeting users for exploitation or protection to treat them as a class than as individuals.

BTW, these attributes don’t sound amenable to IRI identifiers or binary assignment choices.

Gmail Email analysis with Neo4j – and spreadsheets

Thursday, April 25th, 2013

Gmail Email analysis with Neo4j – and spreadsheets by Rik Van Bruggen.

From the post:

A bunch of different graphistas have pointed out to me in recent months that there is something funny about Graphs and email. Specifically, about graphs and email analysis. From my work in previous years at security companies, I know that Email Forensics is actually big business. Figuring out who emails whom, about what topics, with what frequency, at what times – is important. Especially when the proverbial sh*t hits the fan and fraud comes to light – like in the Enron case. How do I get insight into email traffic? How do I know what was communicated to who? And how do I get that insight, without spending a true fortune?

An important demonstration that sophisticated data analysis may originate with fairly pedestrian authoring tools.

For the Enron emails, see: Enron Email Dataset. Reported to be 0.5M messages, approximately 423Mb, tarred and gzipped.

The topic map question is what to do with separate graphs of:

  • Enron emails,
  • Enron corporate structure,
  • Social relationships between Enron employees and others,
  • Documents of other types interchanged or read inside of Enron,
  • Travel and expense records, and,
  • Phone logs inside Enron?

Graphs of any single data set can be interesting.

Merging graphs of inter-related data sets can be powerful.

Day Nine of a Predictive Coding Narrative: A scary search…

Wednesday, August 8th, 2012

Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma by Ralph Losey.

From the post:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

I can appreciate Ralph’s apprehension at a hindsight review of decisions already made. In legal proceedings, decisions are made and they move forward. Some judgements/mistakes can be corrected, others are simply case history.

Days Seven and Eight of a Predictive Coding Narrative [Re-Use of Analysis?]

Wednesday, August 8th, 2012

Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God by Ralph Losey.

From the post:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Ralph discovered new terms to use for training as the analysis of the documents progressed.

While Ralph captures those for his use, my question would be how to capture what he learned for re-use?

As in re-use by other parties, perhaps in other litigation.

Thinking of reducing the cost of discovery by sharing analysis of data sets, rather than every discovery process starting at ground zero.

Days Five and Six of a Predictive Coding Narrative

Friday, July 27th, 2012

Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment by Ralph Losey.

From the post:

This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.

In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.

You will learn a lot about predictive analytics and e-discovery from this series of posts but the most important paragraphs I have read thus far:

When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.

Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.

Insisting that experts explain until you understand what is being said will help you avoid costly mistakes and make you more sympathetic to a client’s questions when you are the expert.

The technology and software will change for predictive coding will change beyond recognition in a few short years.

Demanding and giving explanations that “explain” is a skill that will last a lifetime.

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree

Friday, July 27th, 2012

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree by Ralph Losey.

From the post:

This is the third in a series of detailed descriptions of a legal search project. The project was an academic training exercise for Jackson Lewis e-discovery liaisons conducted in May and June 2012. I searched a set of 699,082 Enron emails and attachments for possible evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.

The description of day-two was short, but it was preceded by a long explanation of my review plan and search philosophy, along with a rant in favor of humanity and against over-dependence on computer intelligence. Here I will just stick to the facts of what I did in days three and four of my search using Kroll Ontrack’s (KO) Inview software.

Interesting description of where Ralph and the computer disagree on relevant/irrelevant judgement on documents.

Unless I just missed it, Ralph is only told be the software what rating a document was given, not why the software arrived at that rating. Yes?

If you knew what terms drove a particular rating, it would be interesting to “comment out” those terms in a document to see the impact on its relevance rating.

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane

Friday, July 13th, 2012

Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane by Ralph Losey.

From the post:

Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.

More than you are probably going to want to know about sample sizes and their calculation but persevere until you get to the defense of humanity stuff. It is all quite good.

If I had to add a comment on the defense of humanity rant, it would be that machines have a flat view of documents and not the richly textured one of a human reader. While true that machines can rapidly compare document without tiring, they will miss an executive referring to a secretary as his “cupcake.” A reference that would jump out at a human reader. Same text, different result.

Perhaps because in one case the text is being scanned for tokens and in the other case it is being read.

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

Friday, July 13th, 2012

Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey.

The start of a series of posts on predictive coding and searching of the Enron emails by a lawyer. A legal perspective is important enough that I will be posting a note about each post in this series as they occur.

A couple of preliminary notes:

I am sure this is the first time that Ralph has used predictive encoding with the Enron emails. On the other hand, I would not take “…this is the first time for X…” sort of claims from any vendor or service organization. 😉

You can see other examples of processing the Enron emails at:

And that is just a “lite” scan. There are numerous other projects that use the Enron email collection.

I wonder if that is because we are naturally nosey?

From the post:

This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.

Joys and Risks of Being First

To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.

There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies (sorry Jason).

A search project like this takes an enormous amount of time. In fact, to my knowledge (Maura, please correct me if I’m wrong), no Legal Track TRECkies have ever recorded and reported the time that they put into the project, although there are rumors. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.

There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:

It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.

I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.

Reverse engineering targeted emails from 2012 Campaign

Thursday, May 31st, 2012

Reverse engineering targeted emails from 2012 Campaign

Nathan Yau writes:

After noticing the Obama campaign was sending variations of an email to voters, ProPublica identified six distinct types with certain demographics and showed the differences. It was called the Message Machine. Now ProPublica is taking it a step further, hoping to dissect every email from all 2012 campaigns.

Fewer emails than in e-discovery or email archives.

Same or different tools/techniques?