Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.
Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.
To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.
#!/usr/bin/python
import dateutil.parser
import email
import dkim
import glob
output = open("verify.txt", 'w')
output.write ("id|verified|date|from|to|subject|message-id \n")
for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data = f.read()
msg = email.message_from_string(data)
verified = dkim.verify(data)
date = dateutil.parser.parse(msg['date'])
msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']
output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")
output.close()
Download podesta-test.tar.gz, unpack that to a directory and then save/uppack test-clinton-script-24Oct2016.py.gz to the same directory, then:
python test-clinton-script-24Oct2016.py
Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.
Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.
This script leaves much to be desired and as you can see, the results aren’t perfect by any means.
Comments and/or suggestions welcome!
This is just the first step in extracting information from this data set that could be used with similar data sets.
For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?
How are you going to model those relationships?
Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?