Archive for the ‘Hillary Clinton’ Category

Marketing Advice For Shadow Brokers

Tuesday, May 16th, 2017

Shadow Brokers:

I read your post OH LORDY! Comey Wanna Cry Edition outlining your plans for:

In June, TheShadowBrokers is announcing “TheShadowBrokers Data Dump of the Month” service. TheShadowBrokers is launching new monthly subscription model. Is being like wine of month club. Each month peoples can be paying membership fee, then getting members only data dump each month. What members doing with data after is up to members.

TheShadowBrokers Monthly Data Dump could be being:

  • web browser, router, handset exploits and tools
  • select items from newer Ops Disks, including newer exploits for Windows 10
  • compromised network data from more SWIFT providers and Central banks
  • compromised network data from Russian, Chinese, Iranian, or North Korean nukes and missile programs

More details in June.

OR IF RESPONSIBLE PARTY IS BUYING ALL LOST DATA BEFORE IT IS BEING SOLD TO THEPEOPLES THEN THESHADOWBROKERS WILL HAVE NO MORE FINANCIAL INCENTIVES TO BE TAKING CONTINUED RISKS OF OPERATIONS AND WILL GO DARK PERMANENTLY YOU HAVING OUR PUBLIC BITCOIN ADDRESS
… (emphasis in original)

I don’t know your background in subscription marketing but I don’t see Shadow Brokers as meeting the criteria for a successful subscription business. 9 Keys to Building a Successful Subscription Business.

Unless you want to get into a vulnerability as commodity business, with its attendant needs for a large subscriber base, advertising, tech support, etc., with every service layer adding more exposure, I just don’t see it. The risk of exposure is too great and the investment before profit too large.

I don’t feel much better about a bulk purchase from a major government or spy agency. The likely buyers already have the same or similar data so don’t have an acquisition motive.

Moreover, likely buyers don’t trust the Shadow Brokers. As a one time seller, Shadow Brokers could collect for the “lost data” and then release it for free in the wild.

You say that isn’t the plan of Shadow Brokers, but likely buyers are untrustworthy and expect the worst of others.

If I’m right and traditional subscription and/or direct sales models aren’t likely to work, that doesn’t mean that a sale of the “lost data” is impossible.

Consider the Wikileak strategy with the the Podesta emails.

The Podesta emails were replete with office chatter, backbiting remarks, and other trivia.

Despite the lack of intrinsic value, their importance was magnified by the release of small chunks of texts, each of which might include something important.

With each release, main stream media outlets such as the New York Times, the Washington Post, and others went into a frenzy of coverage.

That was non-technical data so a similar strategy with “lost data” will require supplemental, explanatory materials for the press.

Dumping one or two tasty morsels every Friday, for example, will extend media coverage, not to mention building public outrage that could, no guarantees, force one or more governments to pony up for the “lost data.”

Hard to say unless you try.

PS: For anyone who thinks this post runs afoul of “aiding hackers” prohibitions, you have failed to consider the most likely alternate identity of Shadow Brokers, that of the NSA itself.

Ask yourself:

Who wants real time surveillance of all networks? (NSA)

What will drive acceptance of real time surveillance of all networks? (Hint, ongoing and widespread data breaches.)

Who wants to drive adoption of Windows 10? (Assuming NSA agents wrote backdoors into the 50 to 60 million lines of code in Windows 10.)

Would a government that routinely assassinates people and overthrows other governments hesitate to put ringers to work at Microsoft? Or other companies?

Is suborning software verboten? (Your naiveté is shocking.)

Russian Comfort Food For Clinton Supporters

Saturday, December 24th, 2016

Just in time for Christmas, CrowdStrike published:

Use of Fancy Bear Android Malware in Tracking of Ukranian Field Artillery Units

Anti-Russian Propaganda

The cover art reminds me of the “Better Dead than Red” propaganda from deep inside the Cold War.

crowdstrike-fancy-bear-460

Compare that with an anti-communist poster from 1953:

1953-poster-6241f-460

Anonymous, [ANTI-COMMUNIST POSTER SHOWING RUSSIAN SOLDIER AND JOSEPH STALIN STANDING OVER GRAVES IN FOREGROUND; CANNONS AND PEOPLE MARCHING TO SIBERIA IN BACKGROUND] (1953) courtesy of Library of Congress [LC-USZ62-117876].

Notice any similarities? Sixty-three years separate those two images and yet the person who produced the second one would immediately recognize the first one. And vice versa.

Apparently, July Woodruff, who interviewed Dmitri Alperovitch, co- founder of CrowdStrike, and Thomas Rid, a professor at King’s College London for Security company releases new evidence of Russian role in DNC hack (PBS Fake News Hour), didn’t bother to look at the cover of the report covered by her “interview.”

Not commenting on Judy’s age but noting the resemblance to 1950’s and 1960’s anti-communist propaganda would be obvious to anyone in her graduating class.

Evidence or Rather the Lack of Evidence

Leaving aside Judy’s complete failure to notice this is anti-Russian propaganda by its cover, let’s compare the “evidence” Judy discusses with Alperovich:

[Judy Woodruff]

Dmitri Alperovitch, let me start with you. What is this new information?

DMITRI ALPEROVITCH, CrowdStrike: Well, this is an interesting case we’ve uncovered actually all the way in Ukraine where Ukraine artillerymen were targeted by the same hackers who were called Fancy Bear, that targeted the DNC, but this time, they were targeting their cell phones to understand their location so that the Russian military and Russian artillery forces can actually target them in the open battle.

JUDY WOODRUFF: So, this is Russian military intelligence who got hold of information about the weapons, in essence, that the Ukrainian military was using, and was able to change it through malware?

DMITRI ALPEROVITCH: Yes, essentially, one Ukraine officer built this app for his Android phone that he gave out to his fellow officers to control the settings for the artillery pieces that they were using, and the Russians actually hacked that application, put their malware in it and that malware reported back the location of the person using the phone.

JUDY WOODRUFF: And so, what’s the connection between that and what happened to the Democratic National Committee?

DMITRI ALPEROVITCH: Well, the interesting is that it was the same variant of the same malicious code that we have seen at the DNC. This was a phone version. What we saw at the DNC was personal computers, but essentially, it was the same source used by this actor that we call Fancy Bear.

And when you think about, well, who would be interested in targeting Ukraine artillerymen in eastern Ukraine who has interest in hacking the Democratic Party, Russia government comes to find, but specifically, Russian military that would have operational over forces in the Ukraine and would target these artillerymen.

JUDY WOODRUFF: So, just quickly, in the sense, these are like cyber fingerprints? Is that what we’re talking about?

DMITRI ALPEROVITCH: Essentially the DNA of this malicious code that matches to the DNA that we saw at the DNC.

That may sound compelling, at least until you read the Crowdstrike report. Which unlike Judy/PBS, I include a link for you to review it for yourself: Use of Fancy Bear Android Malware in Tracking of Ukranian Field Artillery Units.

The report consists of a series of un-numbered pages, but in order:

Coverpage: (the anti-Russian artwork)

Key Points: Conclusions without evidence (1 page)

Background: Repetition of conclusions (1 page)

Timelines: No real relationship to the question of malware (2 pages)

Timeline of Events: Start of prose that might contain “evidence” (6 pages)

OK, let’s take:

the Russians actually hacked that application, put their malware in it and that malware reported back the location of the person using the phone.

as an example.

Contrary to his confidence in the interview, page 7 of the report says:


Crowdstrike has discovered indications that as early as 2015 FANCY BEAR likely developed X-Agent applications for the iOS environment, targeting “jailbroken” Apple mobile devices. The use of the X-Agent implant in the original Попр-Д30.apk application appears to be the first observed case of FANCY BEAR malware developed for the Android mobile platform. On 21 December 2014 the malicious variant of the Android application was first observed in limited public distribution on a Russian language, Ukrainian military forum. A late 2014 public release would place the development timeframe for this implant sometime between late-April 2013 and early December 2014.

I’m sorry, but do you see any evidence in “…indications…” and/or “likely developed…?”

It’s a different way of restating what you saw in the Key Points and Background, but otherwise, it’s simply repetition of Crowdstrike’s conclusions.

That’s ok if you already agree with Crowdstrike’s conclusions, I suppose, but it should be deeply unsatisfying for a news reporter.

Judy Woodruff should have said:

Imagined question from Woodruff:

I understand your report says Fancy Bear is connected with this malware but you don’t state any facts on which you base that conclusion. Is there another report our listeners can review for those details?

If you see that question in the transcript ping me. I missed it.

What About Calling the NSA?

If Woodruff had even a passing acquaintance with Clifford Stoll’s Cuckoo’s Egg (tracing a hacker from a Berkeley computer to a home in Germany), she could have asked:

Thirty years ago, Clifford Stoll wrote in the Cuckoo’s Egg about the tracking of a hacker from a computer in Berkeley to his home in Germany. Crowdstrike claims to have caught the hackers “red handed”.

The internet has grown more complicated in thirty years and tracking more difficult. Why didn’t Crowdstrike ask for help from the NSA in tracking those hackers?

I didn’t see that question being asked. Did you?

Tracking internet traffic is beyond the means of Crowdstrike, but several nation states are rumored to be sifting backbone traffic every day.

Factual Confusion and Catastrophe at Crowdsrike

The most appalling part of the Crowdstrike report is its admixture of alleged fact, speculation and wishful thinking.

Consider its assessment of the spread and effectiveness of the alleged malware (without more evidence, I would not even concede that it exists):

  1. CrowdStrike assesses that Попр-Д30.apk was potentially used through 2016 by at least one artillery unit operating in eastern Ukraine. (page 6)

  2. Open-source reporting indicates losses of almost 50% of equipment in the last 2 years of conflict amongst Ukrainian artillery forces and over 80% of D-30 howitzers were lost, far more than any other piece of Ukrainian artillery (page 8)

  3. A malware-infected Попр-Д30.apk application probably could not have provided all the necessary data required to directly facilitate the types of tactical strikes that occurred between July and August 2014. (page 8)

  4. The X-Agent Android variant does not exhibit a destructive function and does not interfere with the function of the original Попр-Д30.apk application. Therefore, CrowdStrike Intelligence has assessed that the likely role of this malware is strategic in nature. (page 9)

  5. Additionally, a study provided by the International Institute of Strategic Studies determined that the weapons platform bearing the highest losses between 2013 and 2016 was the D-30 towed howitzer.11 It is possible that the deployment of this malware infected application may have contributed to the high-loss nature of this platform. (page 9)

Judy Woodruff and her listeners don’t have to be military experts to realize that the claim of “one artillery unit” (#1) is hard to reconcile with the loss of “over 80% of D-30 howitzers” (#2) Nor do the claims of the malware being of “strategic” value, (#3, #4), work well with the “high-loss” described in #5.

The so-called “report” by Crowdstrike is a repetition of conclusions drawn on evidence (alleged to exist), the nature and scope of which is concealed from the reader.

Conclusion

However badly Clinton supporters want to believe in Russian hacking of the DNC, this report offers nothing of the kind. It creates the illusion of evidence that deceives only those already primed to accept its conclusions.

Unless and until Crowdstrike releases real evidence, logs, malware (including prior malware and how it was obtained), etc., this must be filed under “fake news.”

Drip, Drip, Drip, Leaking At Wikileaks

Monday, November 7th, 2016

wikileaks-dnc-460

Two days before the U.S. Presidential election, Wikileaks released 8,200 emails from the Democratic National Committee (DNC). Which were in addition to its daily drip, drip, drip leaking of emails from John Podesta, Hillary Clinton’s campaign chair.

The New York Times, a sometimes collaborator with Wikileaks (The War Logs (NYT)), has sponsored a series of disorderly and nearly incoherent attacks on Wikileaks for these leaks.

The dominant theme in those attacks is that readers should not worry their shallow and insecure minds about social media but rely upon media outlets to clearly state any truth readers need to know.

I am not exaggerating. The exact language that appears in one such attack was:

…people rarely act like rational, civic-minded automatons. Instead, we are roiled by preconceptions and biases, and we usually do what feels easiest — we gorge on information that confirms our ideas, and we shun what does not.

Is that how you think of yourself? It is how the New York Times thinks about you.

There are legitimate criticisms concerning Wikileaks and its drip, drip, drip leaking but the Times manages to miss all of them.

For example, the daily drops of Podesta emails, selected on some “unknown to the public” criteria, prevented the creation of a coherent narrative by reporters and the public. The next day’s leak might contain some critical link, or not.

Reporters, curators and the public were teased with drips and drabs of information, which served to drive traffic to the Wikileaks site, traffic that serves no public interest.

If that sounds ungenerous, consider that as the game draws to a close, that Wikileaks has finally posted a link to the Podesta emails in bulk: https://file.wikileaks.org/file/podesta-emails/.

To be sure, some use has been made of the Podesta emails, my work and that of others on DKIM signatures (unacknowledged by Wikileaks when it “featured” such verification on email webpages), graphs, etc. but early bulk release of the emails would have enabled much more.

For example:

  • Concordances of the emails and merging those with other sources
  • Connecting the dots to public or known only to others data
  • Entity recognition and linking in extant resources and news stories
  • Fitting the events, people, places into coherent narratives
  • Sentiment analysis
  • etc.

All of that lost because of the “Wikileaks look at me” strategy for releasing the Podesta emails.

I applaud Wikileaks obtaining and leaking data, including the Podesta emails, but, a look at me strategy impairs the full exploration and use of leaked data.

Is that really the goal of Wikileaks?

PS: If you are interested in leaking without games or redaction, ping me. I’m interested in helping with such leaks.

Clinton/Podesta Map (through #30)

Saturday, November 5th, 2016

Charlie Grapski created Navigating Wikileaks: A Guide to the Podesta Emails.

podesta-map-grapski-460

The listing take 365 pages to date so this is just a tiny sample image.

I don’t have a legend for the row coloring but have tweeted to Charlie about the same.

Enjoy!

Does Verification Matter? Clinton/Podesta Emails Update

Wednesday, November 2nd, 2016

As of today, 10,357 DKIM Verified Clinton/Podesta Emails (of 43,526 total). That’s releases 1-26.

I ask “Does Verification Matter?” in the title to this post because of the seeming lack of interest in verification of emails in the media. Not that it would ever be a lead, but some mention of the verified/not status of an email seems warranted.

Every Clinton/Podesta story mentions Antony Weiner’s interest in sharing his sexual insecurities and nary a peep about the false Clinton/Obama/Clapper claims that emails have been altered. Easy enough to check. But no specifics are given or requested by the press.

Thanks to the Clinton/Podesta drops by Michael Best, @NatSecGeek, I have now uploaded:

DKIM-verified-podesta-1-26.txt.gz is a sub-set of 10,357 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-26.txt.gz is the full set of Podesta emails to date, some 43,526, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Enjoy!

PS: Perhaps verification doesn’t matter when the media repeats false and/or delusional statements of DNI Clapper in hopes of…, I don’t know what they are hoping for but I am hoping they are dishonest, not merely stupid.

How To DeDupe Clinton/Weiner/Abedin Emails….By Tomorrow

Tuesday, November 1st, 2016

The report by Haliman Abdullah, FBI Working to Winnow Through Emails From Anthony Weiner’s Laptop, casts serious doubt on the technical prowess of the FBI when it says:


Officials have been combing through the emails since Sunday night — using a program designed to find only the emails to and from Abedin within the time when Clinton was secretary of state. Agents will compare the latest batch of messages with those that have already been investigated to determine whether any classified information was sent from Clinton’s server.

This process will take some time, but officials tell NBC News that they hope that they will wrap up the winnowing process this week.

Since Sunday night?

Here’s how the FBI, using standard Unix tools, could have finished the “winnowing” in time for the Monday evening news cycle:

  1. Transform (if not already) all the emails into .eml format (to give you separate files for each email).
  2. Grep the resulting file set for emails that contain the Clinton email server by name or addess.
  3. Save the result of #2 to a file and copy all those messages to a separate directory.
  4. Extract the digital signature from each of the copied messages (see below), save to the Abedin file digital signature + file name where found.
  5. Extract the digital signatures from previously reviewed Clinton email server emails, save digital signatures only to the prior-Clinton-review file.
  6. Search for each digital signature in the Abedin file in the prior-Clinton-review file. If found, reviewed. If not found, new email.

The digital signatures are unique to each email and can therefore be used to dedupe or in this case, identify previously reviewed emails.

Here’s a DKIM example signature:

How can I read the DKIM header?

Here is an example DKIM signature (recorded as an RFC2822 header field) for the signed message:

DKIM-Signature a=rsa-sha1; q=dns;
d=example.com;
i=user@eng.example.com;
s=jun2005.eng; c=relaxed/simple;
t=1117574938; x=1118006938;
h=from:to:subject:date;
b=dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
av+yuU4zGeeruD00lszZVoG4ZHRNiYzR

Let’s take this piece by piece to see what it means. Each “tag” is associated with a value.

  • b = the actual digital signature of the contents (headers and body) of the mail message
  • bh = the body hash
  • d = the signing domain
  • s = the selector
  • v = the version
  • a = the signing algorithm
  • c = the canonicalization algorithm(s) for header and body
  • q = the default query method
  • l = the length of the canonicalized part of the body that has been signed
  • t = the signature timestamp
  • x = the expire time
  • h = the list of signed header fields, repeated for fields that occur multiple times

We can see from this email that:

  • The digital signature is dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
    av+yuU4zGeeruD00lszZVoG4ZHRNiYzR
    .
    This signature is matched with the one stored at the sender’s domain.
  • The body hash is not listed.
  • The signing domain is example.com.
    This is the domain that sent (and signed) the message.
  • The selector is jun2005.eng.
  • The version is not listed.
  • The signing algorithm is rsa-sha1.
    This is the algorith used to generate the signature.
  • The canonicalization algorithm(s) for header and body are relaxed/simple.
  • The default query method is DNS.
    This is the method used to look up the key on the signing domain.
  • The length of the canonicalized part of the body that has been signed is not listed.
    The signing domain can generate a key based on the entire body or only some portion of it. That portion would be listed here.
  • The signature timestamp is 1117574938.
    This is when it was signed.
  • The expire time is 1118006938.
    Because an already signed email can be reused to “fake” the signature, signatures are set to expire.
  • The list of signed header fields includes from:to:subject:date.
    This is the list of fields that have been “signed” to verify that they have not been modified.

From: What is DKIM? Everything You Need to Know About Digital Signatures by Geoff Phillips.

Altogether now, to eliminate previously reviewed emails we need only compare:

dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSbav+yuU4zGeeruD00lszZVoG4ZHRNiYzR (example, use digital signatures from Abedin file)

to the digital signatures in the prior-Clinton-review file.

Those that don’t match, are new files to review.

Why the news media hasn’t pressed the FBI on its extremely poor data processing performance is a mystery to me.

You?

PS: FBI field agents with data mining questions, I do off-your-books freelance consulting. Apologies but on-my-books for the tax man. If they don’t tell, neither will I.

9,477 DKIM Verified Clinton/Podesta Emails (of 39,878 total (today))

Monday, October 31st, 2016

Still working on the email graph and at the same time, managed to catch up on the Clinton/Podesta drops by Michael Best, @NatSecGeek, at least for a few hours.

DKIM-verified-podesta-1-24.txt.gz is a sub-set of 9,477 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-24.txt.gz is the full set of Podesta emails to date, some 39,878, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Question: Have you seen any news reports that mention emails being “verified” in their reporting?

Emails in the complete set may be as accurate as those in the verified set, but I would think verification is newsworthy in and of itself.

You?

Clinton/Podesta Emails 23 and 24, True or False? Cryptographically Speaking

Monday, October 31st, 2016

Catching up on the Clinton/Podesta email releases from Wikileaks, via Michael Best, NatSecGeek. Michael bundles the releases up and posts them at: Podesta emails (zipped).

For anyone coming late to the game, DKIM “verified” means that the DKIM signature on an email is valid for that email.

In lay person’s terms, that email has been proven by cryptography to have originated from a particular mail server and when it left that mail server, it read exactly as it does now, i.e., no changes by Russians or others.

What I have created are files that lists the emails in the order they appear at Wikileaks, with the very next field being True or False on the verification issue.

Just because an email has “False” in the second column doesn’t mean it has been modified or falsified by the Russians.

DKIM signatures fail for all manner of reasons but when they pass, you have a guarantee the message is intact as sent.

For your research into these emails:

DKIM-complete-podesta-23.txt.gz

and

DKIM-complete-podesta-24.txt.gz.

For release 24, I did have to remove the DKIM signature on 39256 00010187.eml in order for the script to succeed. That is the only modification I made to either set of files.

Clinton/Podesta Emails – Towards A More Complete Graph (Part 3) New Dump!

Sunday, October 30th, 2016

As you may recall from Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I didn’t check to see if “|” was in use as a separator in the extracted emails subject lines so when I tried to create node lists based on “|” as a separator, it failed.

That happens. More than many people are willing to admit.

In the meantime, a new dump of emails has arrived so I created the new DKIM-incomplete-podesta-1-22.txt.gz file. Which mean picking a new separator to use for the resulting file.

Advice: Check your proposed separator against the data file before using it. I forgot, you shouldn’t.

My new separator? |/|

Which I checked against the file to make sure there would be no conflicts.

The sed commands to remove < and > are the same as in Part 2.

Sigh, back to failure land, again.

Just as one sample:

awk 'FS="|/|" { print $7}' test.me

where test.me is:

9991 00013434.eml|/|False|/|2015-11-21 17:15:25-05:00|/|Eryn Sepp eryn.sepp@gmail.com|/|John Podesta john.podesta@gmail.com|/|Re: Nov 30 / Future Plans / Etc.!|/|8A6B3E93-DB21-4C0A-A548-DB343BD13A8C@gmail.com

returns:

Future Plans

I also checked that with gawk and nawk, with the same result.

For some unknown (to me) reason, all three are treating the first “/” in field 6 (by my count) as a separator, along with the second “/” in that field.

To test that theory, what do you think { print $8 } will return?

You’re right!

Etc.!|

So with the “|/|” separator, I’m going to have up to at least 9 fields, perhaps more, varying depending on whether “/” characters occur in the subject line.

🙁

That’s not going to work.

OK, so I toss the 10+ MB DKIM-complete-podesta-1-22.txt.gz into Emacs, whose regex treatment I trust, and change “|/|” to “@@@@@” and save that file as DKIM-complete-podesta-1-22-03.txt.

Another sanity check, which got us into all this trouble last time:

awk 'FS="@@@@@" { print $7}' podesta-1-22-03.txt | grep @ | wc -l

returns 36504, which plus the 16 files I culled as failures, equals 36520, the number of files in the Podesta 1-22 release.

Recall that all message-ids contain an @ sign to the correct answer on the number of files gives us confidence the file is ready for further processing.

Apologies for it taking this much prose to go so little a distance.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Our first node for the node list (Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)) was to capture the emails themselves.

Using Message-Id (field 7) as the identifier and Subject (field 6) as its label.

We are about to encounter another problem but let’s walk through it.

An example of what we are expecting:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg
@mail.gmail.com;”Re: Tomorrow”;

We have the Message-Id with a closing “;”, followed by the Subject, surrounded in double quote marks and also terminated by a “;”.

FYI: Mixing single and double quotes in awk is a real pain. I struggled with it but then was reminded I can declare variables:

-v dq='"'

which allows me to do this:

awk -v dq='"' 'FS="@@@@@" { print $7 ";" dq $6 dq ";"}' podesta-1-22-03.txt

The awk variable trick will save you considerable puzzling over escape sequences and the like.

Ah, now we are to the problem I mentioned above.

In the part 1 post I mentioned that while:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com;”Re: Tomorrow”;

works,

but having:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;https://wikileaks.org/podesta-emails/emailid/9998;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com;”Re: Tomorrow”;https://wikileaks.org/podesta-emails/emailid/9999;

with Wikileaks links is more convenient for readers.

As you may recall, the last two lines read:

9998 00022160.eml@@@@@False@@@@@2015-06-23 23:01:55-05:00@@@@@Jerome Tatar jerry@TatarLawFirm.com@@@@@Jerome Tatar Jerome jerry@tatarlawfirm.com@@@@@Knox Knotes@@@@@CAC9z1zL9vdT+9FN7ea96r
+Jjf2=gy1+821u_g6VsVjr8U2eLEg@mail.gmail.com
9999 00013746.eml@@@@@False@@@@@2015-04-03 01:14:56-04:00@@@@@Eryn Sepp eryn.sepp@gmail.com@@@@@John Podesta john.podesta@gmail.com@@@@@Re: Tomorrow@@@@@CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com

Which means in addition to printing Message-Id and Subject as fields one and two, we need to split ID on the space and use the result to create the URL back to Wikileaks.

It’s late so I am going to leave you with DKIM-incomplete-podesta-1-22.txt.gz. This is complete save for 16 files that failed to parse. Will repost tomorrow with those included.

I have the first node file script working and that will form the basis for the creation of the edge lists.

PS: Look forward to running awk files tomorrow. It makes a number of things easier.

Clinton/Podesta Emails – 1-22 – Progress Report

Saturday, October 29th, 2016

Michael Best has posted a bulk download of the Clinton/Podesta Eamils, 1-22.

Thinking (foolishly) that I could quickly pickup from where I left off yesterday, Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I grabbed the lasted bulk download and tossed the verification script against it.

In addition to the usual funnies of having to repair known defective files, again, I also underestimated how long verification of DKIM signatures takes on 36,000+ emails. Even on a fast box with plenty of memory.

At this point I have the latest release DKIM signatures parsed, but there are several files that fail for no discernible reason.

I’m going to have another go at it in the morning and should have part 3 of the graph work up tomorrow.

Apologies for the delay!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 2)

Friday, October 28th, 2016

I assume you are starting with DKIM-complete-podesta-1-18.txt.gz.

If you are starting with another source, you will need different instructions. 😉

First, remember from Clinton/Podesta Emails – Towards A More Complete Graph (Part 1) that I wanted to delete all the < and > signs from the text.

That’s easy enough (uncompress the text first):

sed 's/<//g' DKIM-complete-podesta-1-18.txt > DKIM-complete-podesta-1-18-01.txt

followed by:

sed 's/>//g' DKIM-complete-podesta-1-18-01.txt > DKIM-complete-podesta-1-18-02.txt

Here’s where we started:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner <jdorner@americanprogress.org>|”‘bigcampaign@googlegroups.com'” <bigcampaign@googlegroups.com>|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|<A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F @CAPMAILBOX.americanprogresscenter.org>
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin <jschwerin@hillaryclinton.com>|hrcrapid <HRCrapid@googlegroups.com>|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|<CAPrY+5KJ=NG+Vs-khDVpe-L=bP5=qvPcZTS5FDam5LixueQsKA@mail.gmail.com>

Here’s the result after the first two sed scripts:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner jdorner@americanprogress.org|”‘bigcampaign@googlegroups.com'” bigcampaign@googlegroups.com|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|
A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F
@CAPMAILBOX.americanprogresscenter.org
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin jschwerin@hillaryclinton.com|hrcrapid HRCrapid@googlegroups.com|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|CAPrY+5KJ=NG+Vs-khDVpe-L=bP5=qvPcZTS5FDam5LixueQsKA@mail.gmail.com

BTW, I increment the numbers of my result files, DKIM-complete-podesta-1-18-01.txt, DKIM-complete-podesta-1-18-02.txt, because when I don’t, I run different sed commands on the same original file, expecting a cumulative result.

That’s spelled – disappointment and wasted effort looking for problems that aren’t problems. Number your result files.

The nodes and edges mentioned in Clinton/Podesta Emails – Towards A More Complete Graph (Part 1):

Nodes

  • Emails, message-id is ID and subject is label, make Wikileaks id into link
  • From/To, email addresses are ID and name is label
  • True/False, true/false as ID, True/False as labels
  • Date, truncated to 2015-07-24 (example), date as id and label

Edges

  • To/From – Edges with message-id (source) email-address (target) to/from (label)
  • Verify – Edges with message-id (source) true/false (target) verify (label)
  • Date – Edges with message-id (source) – date (target) date (label)

Am I missing anything? The longer I look at problems like this the more likely my thinking/modeling will change.

What follows is very crude, command line creation of the node and edge files. Something more elaborate could be scripted/written in any number of languages.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

You don’t have to take my word for it, try this:

awk 'FS="|" { print $7}' DKIM-complete-podesta-1-18-02.txt

The output prints to the console. Those all look like message-ids to me, well, with the exception of the one that reads ” 24 September.”

How much dirty data do you think is in the message-id field?

A crude starting assumption is that any message-id field without the “@” character is dirty.

Let’s try:

awk ‘FS = “|” { print $7} DKIM-complete-podesta-1-18-02.txt | grep -v @ | wc -l

Which means we are going to extract the 7th field, search (grep) over those results for the “@” sign, where the -v switch means only print lines that DO NOT match, and we will count those lines with wc -l.

Ready? Press return.

I get 594 “dirty” message-ids.

Here is a sampling:


Rolling Stone
TheHill
MSNBC, Jeff Weaver interview on Sanders health care plan and his Wall St. ad
Texas Tribune
20160120 Burlington, IA Organizing Event
start spreadin’ the news…..
Building Trades Union (Keystone XL)
Rubio Hits HRC On Cuba, Russia
2.19.2016
Sourcing Story
Drivers Licenses
Day 1
H1N1 Flu Shot Available

Those look an awful lot like subject lines to me. You?

I suspect those subject lines had the separator character “|” in those lines, before we extracted the data from the .eml files.

I’ve tried to repair the existing files but the cleaner solution is to return to the extraction script and the original email files.

More on that tomorrow!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)

Friday, October 28th, 2016

Gephi is a great tool, but it’s only as good as its input.

The Gephi 8.2 email importer (missing in Gephi 9.*) is lossy, informationally speaking, as I have mentioned before.

Here’s a sample from the verification results on podesta-release-1-18:

9981 00045326.eml|False|2015-07-24 12:58:16-04:00|Oren Shur |John Podesta , Robby Mook , Joel Benenson , “Margolis, Jim” , Mandy Grunwald , David Binder , Teddy Goff , Jennifer Palmieri , Kristina Schake , Christina Reynolds , Katie Connolly , “Kaye, Anson” , Peter Brodnitz , “Rimel, John” , David Dixon , Rich Davis , Marlon Marshall , Michael Halle , Matt Paul , Elan Kriegel , Jake Sullivan |FW: July IA Poll Results|<168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com>

The Gephi 8.2 mail importer fails to create a node representing an email message.

I propose we cure that failure by taking the last field, here:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com

and the next to last field:

FW: July IA Poll Results

and putting them as id and label, respectively in a node list:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”;

As part of the transformation, we need to remove the < and > signs around the message ID, then add a ; to mark the end of the ID field and put double quote ” “ around the subject to use it as a label. Then close the second field with another ;.

While we are talking about nodes, all the email addresses change from:

Oren Shur

to:

oshur@hillaryclinton.com; “Oren Shur”;

which are ID and label of the node list, respectively.

I could remove the < and > characters as part of the extraction script but will use sed at the command line instead.

Reminder: Always work on a copy of your data, never the original.

Then we need to create an edge list, one that represents the relationships between the email (as node) to the sender and receivers of the email (also nodes). For this first iteration, I’m going to use labels on the edges to distinguish between senders and receivers.

Assuming my first row of the edges file reads:

Source; Target; Role (I did not use “Type” because I suspect that is a controlled term for Gephi.)

Then the first few edges would read:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; oshur@hillaryclinton.com>; from;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; john.podesta@gmail.com>; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; re47@hillaryclinton.com; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; jbenenson@bsgco.com; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; Jim.Margolis@gmmb.com; to;
….

As you can see, this is going to be a “busy” graph! 😉

Filtering is going to play an important role in exploring this graph, so let’s add nodes that will help with that process.

I propose we add to the node list:

true; True
false; False

as id and labels.

Which means for the edge list we can have:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; true; verify;

Do you have an opinion on the order, source/target for true/false?

Thinking this will enable us to filter nodes that have not been verified or to include only those that have failed verification.

For experimental purposes, I think we need to rework the date field:

2015-07-24 12:58:16-04:00

I would truncate that to:

2015-07-24

and add such truncated dates to the node list:

2015-07-24; 2015-07-24;

as ID and label, respectively.

Then for the edge list:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; 2015-07-24; date;

Reasoning that we can filter to include/exclude nodes based on dates, which if you add enough email boxes, could help visualize the reaction to and propagation of emails.

Even assuming named participants in these emails have “deleted” their inboxes, there are always automatic backups. It’s just a question of persistence before the rest of this network can be fleshed out.

Oh, one last thing. You have probably notice the Wikileaks “ID” that forms part of the filename?

9981 00045326.eml

The first part forms the end of a URL to link to the original post at Wikileaks.

Thus, in this example, 9981 becomes:

https://wikileaks.org/podesta-emails/emailid/9981

The general form being:

https://wikileaks.org/podesta-emails/emailid/(Wikileaks-ID)

For the convenience of readers/users, I want to modify my earlier proposal for the email node list entry from:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”;

to:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”; https://wikileaks.org/podesta-emails/emailid/9981;

Where the third field is “link.”

I am eliding over lots of relationships and subjects but I’m not reluctant to throw it all away and start over.

Your investment in a model isn’t lost by tossing the model, you learn something with every model you build.

Scripting underway, a post on that experience and the node/edge lists to follow later today.

Podesta/Clinton Emails: Filtering by Email Address (Pimping Bill Clinton)

Thursday, October 27th, 2016

The Bill Clinton, Inc. story reminds me of:

Although I steadfastly resist imaging either Bill or Hillary in that video. Just won’t go there!

Where a graph display can make a difference is that instead of just the one email/memo from Bill’s pimp, we can rapidly survey all of the emails in which he appears, in any role.

import-spigot-email-filter-460

I ran that on Gephi 8.2 against podesta-release-1-18 but the results were:

Nodes 0, Edges 0.

Hmmm, there is something missing, possibly the CSV file?

I checked and podesta-release-1-18 has 393 emails where doug@presidentclinton.com appears.

Could try to find the “right” way to filter on email addresses but for now, let’s take a dirty short-cut.

I created a directory to hold all emails with doug@presidentclinton.com and ran into all manner of difficulties because the file names are plagued with spaces!

So much so that I unintentionally (read “by mistake”) saved all the doug@presidentclinton.com posts from podesta-release-1-18 to a different folder than the ones from podesta-release-19.

🙁

Well, but there is a happy outcome and an illustration of yet another Gephi capability.

I build the first graph from the doug@presidentclinton.com posts from podesta-release-1-18 and then with that graph open, imported the doug@presidentclinton.com from podesta-release-19 and appended those results to the open graph.

How cool is that!

Imagine doing that across data sets, assuming you paid close attention to identifiers, etc.

Sorry, back to the graphs, here is the random layout once the graphs were combined:

doug-default-460

Applying the Yifan Hu network layout:

doug-network-yifan-hu-460

I ran network statistics on network diameter and applied colors based on betweenness:

doug-network-centrality-460

And finally, adjusted the font and turned on the labels:

doub-network-labels-460

I have spent a fair amount of time just moving stuff about but imagine if you could interactively explore the emails, creating and trashing networks based on to:, from:, cc:, dates, content, etc.

The limits of Gephi imports were a major source of pain today.

I’m dodging those tomorrow in favor of creating node and adjacency tables with scripts.

PS: Don’t John Podesta and Doug Band look like two pimps in a pod? 😉

PPS: If you haven’t read the pimping Bill Clinton memo. (I think it has some other official title.)

Clinton/Podesta 19, DKIM-verified-podesta-19.txt.gz, DKIM-complete-podesta-19.txt.gz

Wednesday, October 26th, 2016

Michael Best, @NatSecGeek, posted release 19 of the Clinton/Podesta emails at: https://archive.org/details/PodestaEmailszipped today.

A total of 1518 emails, zero (0) of which broke my script!

Three hundred and sixty-three were DKIM verified! DKIM-verified-podesta-19.txt.gz.

The full set of emails, verified and not: DKIM-complete-podesta-19.txt.gz.

I’m still pondering how to best organize the DKIM verified material for access.

I could segregate “verified” emails for indexing. So any “hits” from those searches are from “verified” emails?

Ditto for indexing only attachments of “verified” emails.

What about a graph constructed solely from “verified” emails?

Or should I make verified a property of the emails as nodes? Reasoning that aside from exploring the email importation in Gephi 8.2, it would not be that much more difficult to build node and adjacency lists from the raw emails.

Thoughts/suggestions?

Serious request for help.

Like Gollum, I know what I keep in my pockets, but I have no idea what other people keep in theirs.

What would make this data useful to you?

No Frills Gephi (8.2) Import of Clinton/Podesta Emails (1-18)

Wednesday, October 26th, 2016

Using Gephi 8.2, you can create graphs of the Clinton/Podesta emails based on terms in subject lines or the body of the emails. You can interactively work with all 30K+ (as of today) emails and extract networks based on terms in the posts. No programming required. (Networks based on terms will appear tomorrow.)

If you have Gephi 8.2 (I can’t find the import spigot in 9.0 or 9.1), you can import the Clinton/Podesta Emails (1-18) for analysis as a network.

To save you the trouble of regressing to Gephi 8.2, I performed a no frills/default import and exported that file as podesta-1-18-network.gephi.gz.

Download and uncompress podesta-1-18-network.gephi.gz, then you can pickup at timemark 3.49.

Open the file (your location may differ):

gephi-podesta-open-460

Obligatory hair-ball graph visualization. 😉

gephi-first-look-460

Considerably less appealing that Jennifer Golbeck’s but be patient!

First step, Layout -> Yifan Hu. My results:

yifan-hu-first-460

Second step, Network Diameter statistics (right side, run).

No visible impact on the graph but, now you can change the color and size of nodes in the graph. That is they have attributes on which you can base the assignment of color and size.

Tutorial gotcha: Not one of Jennifer’s tutorials but I was watching a Gephi tutorial that skipped the part about running statistics on the graph prior to assignment of color and size. Or I just didn’t hear it. The menu options appear in documentation but you can’t access them unless and until you run network statistics or have attributes for the assignment of color and size. Run statistics first!

Next, assign colors based on betweenness centrality:

gephi-betweenness-460

The densest node is John Podesta, but if you remove his node, rerun the network statistics and re-layout the graph, here is part of what results:

delete-central-node-460

A no frills import of 31,819 emails results in a graph of 3235 nodes and 11,831 edges.

That’s because nodes and edges combine (merge to you topic map readers) when they have the same identifier or for edges are between the same nodes.

Subject to correction, when that combining/merging occurs, the properties on the respective nodes/edges are accumulated.

Topic mappers already realize there are important subjects missing, some 31,819 of them. That is the emails themselves don’t by default appear as nodes in the network.

Ian Robinson, Jim Webber & Emil Eifrem illustrate this lossy modeling in Graph Databases this way:

graph-databases-lossy-460

Modeling emails without the emails is rather lossy. 😉

Other nodes/subjects we might want:

  • Multiple to: emails – Is who was also addressed important?
  • Multiple cc: emails – Same question as with to:.
  • Date sent as properties? So evolution of network/emails can be modeled.
  • Capture “reply-to” for relationships between emails?

Other modeling concerns?

Bear in mind that we can suppress a large amount of the detail so you can interactively explore the graph and only zoom into/display data after finding interesting patterns.

Some helpful links:

https://archive.org/details/PodestaEmailszipped: The email collection as bulk download, thanks to Michael Best, @NatSecGeek.

https://github.com/gephi/gephi/releases: Where you can grab a copy of Gephi 8.2.

Clinton/Podesta 1-18, DKIM-verified-podesta-1-18.txt.zip, DKIM-complete-podesta-1-18.txt.zip

Tuesday, October 25th, 2016

After a long day of waiting for scripts to finish and re-running them to cross-check the results, I am happy to present:

DKIM-verified-podesta-1-18.txt.gz, which consists of the Podesta emails (7526) which returned true for a test of their DKIM signature.

The complete set of the results for all 31,819 emails, can be found in:

DKIM-complete-podesta-1-18.txt.gz.

An email that has been “verified” has a cryptographic guarantee that it was sent even as it appears to you now.

An email that fails verification, may be just as trustworthy, but its DKIM signature has failed for any number of reasons.

One of my motivations for classifying these emails is to enable the exploration of why DKIM verification failed on some of these emails.

Question: What would make this data more useful/accessible to journalists/bloggers?

I ask because dumping data and/or transformations of data can be useful, it is synthesizing data into a coherent narrative that is the essence of journalism/reporting.

I would enjoy doing the first in hopes of furthering the second.

PS: More emails will be added to this data set as they become available.

Corrupt (fails with my script) files in Clinton/Podesta Emails (14 files out of 31,819)

Tuesday, October 25th, 2016

You may use some other definition of “file corruption” but that’s mine and I’m sticking to it.

😉

The following are all the files that failed against my script and the actions I took to proceed with parsing the files. Not today but I will make a sed script to correct these files as future accumulations of emails appear.

13544 00047141.eml

Date string parse failed:

Date: Wed, 17 Dec 2008 12:35:42 -0700 (GMT-07:00)

Deleted (GMT-07:00).

15431 00059196.eml

Date string parse failed:

Date: Tue, 22 Sep 2015 06:00:43 +0800 (GMT+08:00)

Deleted (GMT+8:00).

155 00049680.eml

Date string parse failed:

Date: Mon, 27 Jul 2015 03:29:35 +0000

Assuming, as the email reports, info@centerpeace.org was the sender and podesta@law.georgetown.edu was the intended receiver, then the offset from UT is clearly wrong (+0000).

Deleted +0000.

6793 00059195.eml

Date string parse fail:

Date: Tue, 22 Sep 2015 05:57:54 +0800 (GMT+08:00)

Deleted (GTM+08:00).

9404 0015843.eml DKIM failure

All of the DKIM parse failures take the form:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 18, in
verified = dkim.verify(data)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 604, in verify
return d.verify(dnsfunc=dnsfunc)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 506, in verify
validate_signature_fields(sig)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 181, in validate_signature_fields
if int(sig[b’x’]) < int(sig[b't']): KeyError: 't'

I simply deleted the DKIM-Signature in question. Will go down that rabbit hole another day.

21960 00015764.eml

DKIM signature parse failure.

Deleted DKIM signature.

23177 00015850.eml

DKIM signature parse failure.

Deleted DKIM signature.

23728 00052706.eml

Invalid character in RFC822 header.

I discovered an errant ‘”‘ (double quote mark) at the start of a line.

Deleted the double quote mark.

And deleted ^M line endings.

25040 00015842.eml

DKIM signature parse failure.

Deleted DKIM signature.

26835 00015848.eml

DKIM signature parse failure.

Deleted DKIM signature.

28237 00015840.eml

DKIM signature parse failure.

Deleted DKIM signature.

29052 0001587.eml

DKIM signature parse failure.

Deleted DKIM signature.

29099 00015759.eml

DKIM signature parse failure.

Deleted DKIM signature.

29593 00015851.eml

DKIM signature parse failure.

Deleted DKIM signature.

Here’s an odd pattern for you, all nine (9) of the fails to parse the DKIM signatures were on mail originating from:

From: Gene Karpinski

But there are approximately thirty-three (33) emails from Karpinski so it doesn’t fail every time.

The file numbers are based on the 1-18 distribution of Podesta emails created by Michael Best, @NatSecGeek, at: Podesta Emails (zipped).

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Tuesday, October 25th, 2016

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.

Why?

Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

http://docs.python.org/library/os.html?highlight=os.listdir#os.listdir

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:

sorted(glob.glob('*.png'))

sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)

etc.

So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):

to:

for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Monday, October 24th, 2016

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.

#!/usr/bin/python

import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data = f.read()
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")

output.close()

Download podesta-test.tar.gz, unpack that to a directory and then save/uppack test-clinton-script-24Oct2016.py.gz to the same directory, then:

python test-clinton-script-24Oct2016.py

Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

Validating Wikileaks Emails [Just The Facts]

Saturday, October 22nd, 2016

A factual basis for reporting on alleged “doctored” or “falsified” emails from Wikileaks has emerged.

Now to see if the organizations and individuals responsible for repeating those allegations, some 260,000 times, will put their doubts to the test.

You know where my money is riding.

If you want to verify the Podesta emails or other email leaks from Wikileaks, consult the following resources.

Yes, we can validate the Wikileaks emails by Robert Graham.

From the post:

Recently, WikiLeaks has released emails from Democrats. Many have repeatedly claimed that some of these emails are fake or have been modified, that there’s no way to validate each and every one of them as being true. Actually, there is, using a mechanism called DKIM.

DKIM is a system designed to stop spam. It works by verifying the sender of the email. Moreover, as a side effect, it verifies that the email has not been altered.

Hillary’s team uses “hillaryclinton.com”, which as DKIM enabled. Thus, we can verify whether some of these emails are true.

Recently, in response to a leaked email suggesting Donna Brazile gave Hillary’s team early access to debate questions, she defended herself by suggesting the email had been “doctored” or “falsified”. That’s not true. We can use DKIM to verify it.

Bob walks you through validating a raw email from Wikileaks with the DKIM verifier plugin for Thunderbird. And demonstrating the same process can detect “doctored” or “falsified” emails.

Bob concludes:

I was just listening to ABC News about this story. It repeated Democrat talking points that the WikiLeaks emails weren’t validated. That’s a lie. This email in particular has been validated. I just did it, and shown you how you can validate it, too.

Btw, if you can forge an email that validates correctly as I’ve shown, I’ll give you 1-bitcoin. It’s the easiest way of solving arguments whether this really validates the email — if somebody tells you this blogpost is invalid, then tell them they can earn about $600 (current value of BTC) proving it. Otherwise, no.

BTW, Bob also points to:

Here’s Cryptographic Proof That Donna Brazile Is Wrong, WikiLeaks Emails Are Real by Luke Rosiak, which includes this Python code to verify the emails:

clinton-python-email-460

and,

Verifying Wikileaks DKIM-Signatures by teknotus, offers this manual approach for testing the signatures:

clinton-sig-check-460

But those are all one-off methods and there are thousands of emails.

But the post by teknotus goes on:

Preliminary results

I only got signature validation on some of the emails I tested initially but this doesn’t necessarily invalidate them as invisible changes to make them display correctly on different machines done automatically by browsers could be enough to break the signatures. Not all messages are signed. Etc. Many of the messages that failed were stuff like advertising where nobody would have incentive to break the signatures, so I think I can safely assume my test isn’t perfect. I decided at this point to try to validate as many messages as I could so that people researching these emails have any reference point to start from. Rather than download messages from wikileaks one at a time I found someone had already done that for the Podesta emails, and uploaded zip files to Archive.org.

Emails 1-4160
Emails 4161-5360
Emails 5361-7241
Emails 7242-9077
Emails 9078-11107

It only took me about 5 minutes to download all of them. Writing a script to test all of them was pretty straightforward. The program dkimverify just calls a python function to test a message. The tricky part is providing context, and making the results easy to search.

Automated testing of thousands of messages

It’s up on Github

It’s main output is a spreadsheet with test results, and some metadata from the message being tested. Results Spreadsheet 1.5 Megs

It has some significant bugs at the moment. For example Unicode isn’t properly converted, and spreadsheet programs think the Unicode bits are formulas. I also had to trap a bunch of exceptions to keep the program from crashing.

Warning: I have difficulty opening the verify.xlsx file. In Calc, Excel and in a CSV converter. Teknotus reports it opens in LibreOffice Calc, which just failed to install on an older Ubuntu distribution. Sing out if you can successfully open the file.

Journalists: Are you going to validate Podesta emails that you cite? Or that others claim are false/modified?

Validating Wikileaks/Podesta Emails

Friday, October 21st, 2016

A quick heads up that Robert Graham is working on:

dkim-01-460

While we wait for that post to appear at Errata Security, you should also take a look at DomainKeys Identified Mail (DKIM).

From the homepage:

DomainKeys Identified Mail (DKIM) lets an organization take responsibility for a message that is in transit. The organization is a handler of the message, either as its originator or as an intermediary. Their reputation is the basis for evaluating whether to trust the message for further handling, such as delivery. Technically DKIM provides a method for validating a domain name identity that is associated with a message through cryptographic authentication.

In particular, review RFC 5585 DomainKeys Identified Mail (DKIM) Service Overview. T. Hansen, D. Crocker, P. Hallam-Baker. July 2009. (Format: TXT=54110 bytes) (Status: INFORMATIONAL) (DOI: 10.17487/RFC5585), which notes:


2.3. Establishing Message Validity

Though man-in-the-middle attacks are historically rare in email, it is nevertheless theoretically possible for a message to be modified during transit. An interesting side effect of the cryptographic method used by DKIM is that it is possible to be certain that a signed message (or, if l= is used, the signed portion of a message) has not been modified between the time of signing and the time of verifying. If it has been changed in any way, then the message will not be verified successfully with DKIM.

In a later tweet, Bob notes the “DKIM verifier” add-on for Thunderbird.

Any suggestions on scripting DKIM verification for the Podesta emails?

That level of validation may be unnecessary since after more than a week of “…may be altered…,” not one example of a modified email has surfaced.

Some media outlets will keep repeating the “…may be altered…” chant, along with attribution of the DNC hack to Russia.

Noise but it is a way to select candidates for elimination from your news feeds.

The Podesta Emails [In Bulk]

Wednesday, October 19th, 2016

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:

podesta-basic-search-460

and, advanced:

podesta-adv-search-460

As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!

Enjoy!

George Carlin’s Seven Dirty Words in Podesta Emails – Discovered 981 Unindexed Documents

Thursday, October 13th, 2016

While taking a break from serious crunching of the Podesta emails I discovered 981 unindexed documents at Wikileaks!

Try searching for Carlin’s seven dirty words at The Podesta Emails:

  • shit – 44
  • piss – 19
  • fuck – 13
  • cunt – 0
  • cocksucker – 0
  • motherfucker – 0 (?)
  • tits – 0

I have a ? after “motherfucker” because working with the raw files I show one (1) hit for “motherfucker” and one (1) hit for “motherfucking.” Separate emails.

For “motherfucker,” American Sniper–the movie, responded to by Chris Hedges – From:magazine@tikkun.org To: Podesta@Law.Georgetown.Edu

For “motherfucking,” H4A News Clips 5.31.15 – From/To: aphillips@hillaryclinton.com.

“Motherfucker” and “motherfucking” occur in text attachments to emails, which Wikileaks does not search.

If you do a blank search for file attachments, Wikileaks reports there are 2427 file attachments.

Searching the Podesta emails at Wikileaks excludes the contents of 2427 files from your search results.

How significant is that?

Hmmm, 302 pdf, 501 docx, 167 doc, 12 xls, 9 xlsx – 981 documents excluded from your searches at Wikileaks.

For 9,011 emails, as of AM today, my local.

How comfortable are you with not searching those 981 documents? (Or additional documents that may follow?)

When 24 GB Of Physical RAM Pegs At 98% And Stays There

Sunday, October 9th, 2016

Don’t panic! It has a happy ending but I’m too tired to write it up for posting today.

Tune in tomorrow for lessons learned on FOIA answers that don’t set the information free.

Chasing File Names – Check My Work

Saturday, October 8th, 2016

I encountered a stream of tweets of which the following are typical:

guccifer2-0-tweets-cf-7z-460

Hmmm, is cf.7z a different set of files from ebd-cf.7z?

You could “eye-ball” the directory listings but that is tedious and error-prone.

Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.

Running

uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l

(“-d” for duplicate lines) on the resulting file, piping it into wc -l, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)

Running

uniq -u cf-7z-and-ebd-cf-files-Sorted.txt

(“-u” for unique lines), will give you no return (no unique lines).

With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.

BTW, do you think a similar technique could be applied to spreadsheets?

DNC/DCCC/CF Excel Files, As Of October 7, 2016

Friday, October 7th, 2016

A continuation of my post Avoiding Viruses in DNC/DCCC/CF Excel Files.

Where Avoiding Viruses… focused on avoiding the hazards and dangers of Excel-born viruses, this post focuses on preparing the DNC/DCCC/CF Excel files from Guccifer 2.0, as of October 7, 2016, for further analysis.

As I mentioned before, you could search through all 517 files to date, separately, using Excel. That thought doesn’t bring me any joy. You?

Instead, I’m proposing that we prepare the files to be concatenated together, resulting in one fairly large file, which we can then search and manipulate as one entity.

As a data cleanliness task, I prefer to prefix every line in every csv export, with the name of its original file. That will enable us to extract lines that mention the same person over several files and still have a bread crumb trail back to the original files.

Munging all the files together without such a step, would leave us either grepping across the collection and/or using some other search mechanism. Why not plan on avoiding that hassle?

Given the number of files requiring prefixing, I suggest the following:

for f in *.csv*; do
sed -i "s/^/$f,/" $f
done

This shell script uses sed with the -i switch, which means sed changes files in place (think overwriting specified part). Here the s/ means to substitute at the ^, start of each line, $f, the filename plus a comma separator and the final $f, is the list of files to be processed.

There are any number of ways to accomplish this task. Your community may use a different approach.

The result of my efforts is: guccifer2.0-all-spreadsheets-07October2016.gz, which weighs in at 61 MB compressed and 231 MB uncompressed.

I did check and despite having variable row lengths, it does load in my oldish version of gnumeric. All 1030828 lines.

That’s not all surprising for gnumeric, considering I’m running 24 GB of physical RAM. Your performance may vary. (It did hesitate loading it.)

There is much left to be done, such as deciding what padding is needed to even out all the rows. (I have ideas, suggestions?)

Tools to manipulate the CSV. I have a couple of stand-bys and a new one that I located while writing this post.

And, of course, once the CSV is cleaned up, what other means can we use to explore the data?

My focus will be on free and high performance (amazing how often those are found together Larry Ellison) tools that can be easily used for exploring vast seas of spreadsheet data.

Next post on these Excel files, Monday, October 10, 2016.


I am downloading the cf.7z Guccifer 2.0 drop as I write this update.

Watch for updates on the comprehensive file list and Excel files next Monday. October 8, 2016, 01:04 UTC.

Avoiding Viruses in DNC/DCCC/CF Excel Files

Friday, October 7th, 2016

I hope you haven’t opened any of the DNC/DCCC/CF Excel files outside of a VM. 517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

Yes?

Files from trusted sources can contain viruses. Files from unknown or rogue sources even more so. However tempting (and easy) it is to open up alleged purloined files on your desktop, minimal security conscious users will resist the temptation.

Warning: I did NOT scan the Excel files for viruses. The best way to avoid Excel viruses is to NOT open Excel files.

I used ssconvert, one of the utilities included with gnumeric to bulk convert the Excel files to csv format. (Comma Separate Values is documents in RFC 4780.

Tip: If you are looking for a high performance spreadsheet application, take a look at gnumeric.

Ssconvert relies on file extensions (although other options are available) so I started with:

ssconvert -S donors.xlsx donors.csv

The -S option takes care of workbooks with multiple worksheets. You need a later version of ssconvert (mine is 1.12.9-1 (2013) and the current version of gnumeric and ssconvert is 1.12.31 (August 2016), to convert the .xlsx files without warning.

I’m upgrading to Ubuntu 16.04 soon so it wasn’t worth the trouble trying to stuff a later version of gnumeric/ssconvert onto my present Ubuntu 14.04.

Despite the errors, the conversion appears to have worked properly:

donors-01-460

to its csv output:

donor-03-460

I don’t see any problems.

I’m checking a sampling of the other conversions as well.

BTW, do notice the confirmation of reports from some commentators that they contacted donors who confirmed donating, but could not recall the amounts.

Could be true. If you pay protection money often enough, I’m sure it’s hard to recall a specific payment.

Sorry, I got distracted.

So, only 516 files to go.

I don’t recommend you do:

ssconvert -S filename.xlsx filename.csv

516 times. That will be tedious and error prone.

At least for Linux, I recommend:

for f in *.xls*; do
   ssconvert -S $f $f.csv
done

The *.xls* captures both .xsl and .xslx files, then invokes ssconvert -S on the file and then saves the output file with the original name, plus the extension .csv.

The wc -l command reports 1030828 lines in the consolidated csv file for these spreadsheets.

That’s a lot of lines!

I have some suggestions on processing that file, see: DNC/DCCC/CF Excel Files, As Of October 7, 2016.

517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

Thursday, October 6th, 2016

As of today, the data dumps by Guccifer2.0 have contained 517 Excel files.

The vehemence of posts dismissing this dumps makes me wonder two things:

  1. How many of the Excel files these commentators have reviewed?
  2. What is it that you might find in them that worries them so?

I don’t know the answer to #1 and I won’t speculate on their diligence in examining these files. You can reach your own conclusions in that regard.

Nor can I give you an answer to #2, but I may be able to help you explore these spreadsheets.

The old fashioned way, opening each file, at one Excel file per minute, assuming normal Office performance, ;-), would take longer than an eight-hour day to open them all.

You still must understand and compare the spreadsheets.

To make 517 Excel files more than a number, here’s a list of all the Guccifer2.0 released Excel files as of today: guccifer2.0-excel-files-sorted.txt.

(I do have an unfair advantage in that I am willing to share the files I generate, enabling you to check my statements for yourself. A personal preference for fact-based pleading as opposed to conclusory hand waving.)

If you think of each line in the spreadsheets as a record, this sounds like a record linkage problem. Except they have no uniform number of fields, headers, etc.

With record linkage, we would munge all the records into a single record format and then and only then, match up records to see which ones have data about the same subjects.

Thinking about that, the number 517 looms large because all the formats must be reconciled to one master format, before we start getting useful comparisons.

I think we can do better than that.

First step, let’s consider how to create a master record set that keeps all the data as it exists now in the spreadsheets, but as a single file.

See you tomorrow!

Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files)

Wednesday, October 5th, 2016

However amusing the headline ‘Guccifer 2.0’ Is Bullshitting Us About His Alleged Clinton Foundation Hack may be, Lorenzo Fanchschi-Bicchierai offers no factual evidence to support his claim,

… the hacker’s latest alleged feat appears to be a complete lie.

Or should I say that:

  • Clinton Foundation denies it has been hacked
  • The Hill whines about who is a donor where
  • The Daily Caller says, “nothing to see here, move along, move along”

hardly qualifies as anything I would rely on.

Checking the file names is one rough check for duplication.

First, you need a set of the file names for all the releases on Guccifer 2.0’s blog:

Relying on file names alone is iffy as the same “content” can be in files with different names, or different content in files with the same name. But this is a rough cut against thousands of documents, so file names it is.

So you can check my work, I saved a copy of the files listed at the blog in date order: guccifer2.0-File-List-By-Blog-Date.txt..

For combining files for use with uniq, you will need a sorted, uniq version of that file: guccifer2.0-File-List-Blog-Sorted-Uniq-lc-final.txt.

Next, there was a major dump of files under the file name 7dc58-ngp-van.7z, approximately 820 MB of files. (Not listed on the blog but from Guccifer 2.0.)

You can use your favorite tool set or grab a copy of: 7dc58-ngp-van-Sorted-Uniq-lc-final.txt.

You need to combine those file names with those from the blog to get a starting set of names for comparison against the alleged Clinton Foundation hack.

Combining those two file name lists together, sorting them and creating a unique list of file names results in: guccifer2.0-30Sept2016-Sorted-Unique.txt.

Follow the same process for ebd-cf.7z, the file that dropped on the 3rd of October 2016. Or grab: ebd-cf-file-Sorted-Uniq-lc-final.txt.

Next, combine guccifer2.0-30Sept2016-Sorted-Unique.txt (the files we knew about before the 3rd of October) with ebd-cf-file-Sorted-Uniq.txt, and sort those file names, resulting in: guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt.

The final step is to apply uniq -d to guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt, which should give you the duplicate files, comparing the files in ebd-cf.7z to those known before September 30, 2016.

The results?

11-26-08 nfc members raised.xlsx
db1.mdb
donorsbymm.xlsx
donorsbymm_2.xlsx
netdem03-02.xlsx
thumbs.db
viewfecfiling.xls

Seven files out of 2085 doesn’t sound like a high degree of duplication.

At least not to me.

You?

PS: On the allegations about the Russians, you could ask the Communists in the State Department or try the Army General Staff. 😉 Some of McCarty’s records are opening up if you need leads.

PPS: Use the final sorted, unique file list to check future releases by Guccifer 2.0. It might help you avoid bullshitting the public.

Value-Add Of Wikileaks Hillary Clinton Email Archive?

Monday, September 26th, 2016

I was checking Wikileaks today for any new document drops on Hillary Clinton, but only found:

WikiLeaks offers award for #LabourLeaks

Trade in Services Agreement

Assange Medical and Psychological Records

The lesson from the last item is to always seek asylum in a large embassy, preferably one with a pool. You can search at Embassies by embassy for what country, located in what other country. I did not see an easy way to search for size and accommodations.

Oh, not finding any new data on Hillary Clinton, I checked the Hillary Clinton Email Archive at Wikileaks:

wikileaks-hillary-460

Compare that to the State Department FOIA server for Clinton_Email:

state-dept-hillary-460

Do you see a value-add to Wikileaks re-posting the State Department’s posting of Hillary’s emails?

If yes, please report in comments below the value-add you see. (Thanks in advance.)

If not, what do you think would be a helpful value-add to the Hillary Clinton emails? (Suggestions deeply appreciated.)