Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 3, 2016

“Just the texts, Ma’am, just the texts” – Colin Powell Emails Sans Attachments

Filed under: Colin Powell Emails,Government,Politics,Uncategorized — Patrick Durusau @ 7:55 pm

As I reported in Bulk Access to the Colin Powell Emails – Update, I was looking for a host for the complete Colin Powell emails at 2.5 GB, but I failed on that score.

I can’t say if that result is lack of interest in making the full emails easily available or if I didn’t ask the right people. Please circulate my request when you have time.

In the meantime, I have been jumping from one “easy” solution to another, most of which involved parsing the .eml files.

But my requirement is to separate the attachment from the emails, quickly and easily. Not to parse the .eml files in preparation for further process.

How does a 22 character, command line sed expression sound?

Do you know of an “easier” solution?

sed -i '/base64/,$d' *

Reasoning the first attachment (in the event of multiple attachments) will include the string “base64” so I pass a range expression that starts there and ends at the end of the message “$” and delete that pattern, d, and write the files in place “-i.”

There are far more sophisticated solutions to this problem but as crude as this may be, I have reduced the 2.5 GB archive file that includes all the emails and their attachments down to 63 megabytes.

Attachments are important too but my first steps were to make these and similar files more accessible.

Obtaining > 29K files through the drinking straw at DCLeaks or waiting until I find a host for a consolidated 2.5 GB files, doesn’t make these files more accessible.

A 63 MB download of the Colin Powells Emails With No Attachments may.

Please feel free to mirror these files.

PS: One oddity I noticed in testing the download. With Chrome, the file size inflates to 294MB. With Mozilla, the file size is 65MB. ? Both unpack properly. Suggestions?

PPS: More sophisticated processing of the raw emails and other post-processing to follow.

September 27, 2016

Bulk Access to the Colin Powell Emails – Update

Filed under: Colin Powell Emails,Government,Journalism,News,Politics,Reporting — Patrick Durusau @ 7:31 pm

Still working on finding a host for the 2.5 GB tarred, gzipped archive of the Colin Powell emails.

As an alternative, working on splitting the attachments (the main source of bulk) from the emails themselves.

My thinking at this point is to produce a message-only version of the emails. Emails with attachments will have auto-generated links to the source emails at DCLeaks.com.

Other processing is planned for the message-only version of the emails.

Anyone interested in indexing the attachments? Generating lists of those with pointers shouldn’t be a problem.

Hope to have more progress to report tomorrow!

September 26, 2016

Bulk Access to the Colin Powell Emails

Filed under: Colin Powell Emails,Government,Politics — Patrick Durusau @ 7:26 pm

The Colin Powell Email leak is important, but if you visit the DCLeaks page for Powell emails, June, July and August of 2014, this is what you find:

dc-leaks-search-460

If you attempt to use the “search” box, you discover that your search is limited to June, July and August of 2014.

Then you remember the main page:

dcleaks-powell-contents-460

Which means every search must be repeated thirteen (13) times to find all relevant emails.

The phone is ringing, your pager is going off, emails and IMs are piling up and your on deadline. How useful is this interface to you as a reporter?

Have your own methods for processing large leaks of documents?

Not relevant here because access the Powell emails is one email at a time.

Put your drinking straw into a lake of 29,641 emails.

Best of luck with that drinking straw approach.

I’m suggesting a different approach.

What if someone automated that drinking straw and created a mirrored set of those 29,641 emails? Along with correcting the twelve (12) emails that chocked a .eml to .mbox converter.

Interested?

Hosting Request: The full data set runs 2.5 GB, which, if popular, is far more traffic than I can support.

Requirements for hosting:

  1. Distribute the file as delivered to you.
  2. Distribute the file for free.

If you are interested, drop me a line at: patrick@durusau.net.

Warning: I have not checked the files or their attachments for malware, hostile links, etc. Open untrusted files in VMs without network connections. At a minimum.

Test your interest against the emails for March-April of 2016: powell-sample.tar.gz. (roughly 108MB)

Manipulation, enhancement and analysis of samples and the full set to follow.

September 25, 2016

Colin Powell Email Files

Filed under: Colin Powell Emails,Government,Journalism,News,Politics,Reporting — Patrick Durusau @ 8:43 pm

DCLeaks.com posted on September 14, 2016, a set of emails to and from Colin Luther Powell.

From the homepage for those leaked emails:

Colin Luther Powell is an American statesman and a retired four-star general in the United States Army. He was the 65th United States Secretary of State, serving under U.S. President George W. Bush from 2001 to 2005, the first African American to serve in that position. During his military career, Powell also served as National Security Advisor (1987–1989), as Commander of the U.S. Army Forces Command (1989) and as Chairman of the Joint Chiefs of Staff (1989–1993), holding the latter position during the Persian Gulf War. Born in Harlem as the son of Jamaican immigrants, Powell was the first, and so far the only, African American to serve on the Joint Chiefs of Staff, and the first of two consecutive black office-holders to serve as U.S. Secretary of State.

The leaked emails start in June of 2014 and end in August of 2016.

Access to the emails is by browsing and/or full text searching.

Try your luck at finding Powell’s comments on Hillary Clinton or former Vice-President Cheney. Searching one chunk of emails at a time.

I appreciate and admire DCLeaks for taking the lead in posting this and similar materials. And I hope they continue to do so in the future.

However, the access offered reduces a good leak to a random trickle.

This series will use the Colin Powell emails to demonstrate better leaking practices.

Coming Monday, September 26, 2016 – Bulk Access to the Colin Powell Emails.

Powered by WordPress