Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 24, 2016

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Filed under: Data Mining,Hillary Clinton,Politics — Patrick Durusau @ 9:05 pm

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.

#!/usr/bin/python


import dateutil.parser

import email

import dkim

import glob
output = open("verify.txt", 'w')
output.write ("id|verified|date|from|to|subject|message-id \n")
for name in glob.glob('*.eml'):

    filename = name

    f = open(filename, 'r')

    data = f.read()

    msg = email.message_from_string(data)
    verified = dkim.verify(data)
    date = dateutil.parser.parse(msg['date'])
    msg_from = msg['from']

    msg_from1 = " ".join(msg_from.split())

    msg_to = str(msg['to'])

    msg_to1 = " ".join(msg_to.split())

    msg_subject =  str(msg['subject'])

    msg_subject1 = " ".join(msg_subject.split())

    msg_message_id =  msg['message-id']
    output.write (filename + '|' + str(verified) + '|' + str(date) +

    '|' + msg_from1 + '|' + msg_to1 +  '|' + msg_subject1 +

    '|' + str(msg_message_id) + "\n")

output.close()

Download podesta-test.tar.gz, unpack that to a directory and then save/uppack test-clinton-script-24Oct2016.py.gz to the same directory, then:

python test-clinton-script-24Oct2016.py

Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 24, 2016

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

No Comments