Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.

Why?

Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

http://docs.python.org/library/os.html?highlight=os.listdir#os.listdir

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:

sorted(glob.glob('*.png'))

sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)

etc.

So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):

to:

for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

Comments are closed.