Exploring the Enron Spreadsheet/Email Archive

I forgot to say yesterday that if you cite the work of Felienne Hermans and Emerson Murphy-Hill Enron archive, use this citation:

  author    = {Felienne Hermans and
               Emerson Murphy-Hill},
  title     = {Enron's Spreadsheets and Related Emails: A Dataset and Analysis},
  booktitle = {37th International Conference on Software Engineering, {ICSE} '15},
  note     =  {to appear}

A couple of interesting tidbits from this morning.

Non-Matching Spreadsheet Names

If you look at:

(local)/84_JUDY_TOWNSEND_000_1_1.PST/townsend-j/JTOWNSE (Non-Privileged)/Inbox/_1687004514.eml

You will find that David.Jones@ENRON.com (sender), sent an email with Tport Max Rates Calculations 10-27-01.xls attached, to fletcjv@NU.COM and cc:ed “Concannon” and “Townsend” . (Potential subjects in bold.)

I selected this completely at random, save for finding an email that using the word “spreadsheet.”

If you look in the spreadsheet archive, you will not find “Tport Max Rates Calculations 10-27-01.xls,” at least not by that name. You will find: “judy_townsend__17745__Tport Max Rates Calculations 10-27-01.xlsx.”

I don’t know when that conversion took place but thought it was worth noting. BTW, the spreadsheet archive has 15,871 .xlsx files and 58 .xls files. Michelle Lokay has thirty-two of the fifty-eight (58) .xls files but they all appear to be duplicated by files with the .xlsx extension.

Given the small number, I suspect an anomaly in a bulk conversion process. When I do group operations on the spreadsheets I will be using the .xlsx extension only to avoid duplicates.

Dirty, Very Dirty Data

I was just randomly opening spreadsheets when I encountered this jewel:


Using rows to format column headers. There are worse examples, try:


No columns headers at all! (On this tab.)

I am beginning to suspect that the conversion to .xslx format was to enable the use of better tooling to explore the originally .xls files.

Be sure to register for Balisage 2015 if you want to see the outcome of all this running around!

Tomorrow I think we are going to have a conversation about indexing email with Solr. Having all 15K spreadsheets doesn’t tell me which ones were spoken of the most often in email.

Comments are closed.