Archive for October, 2016

9,477 DKIM Verified Clinton/Podesta Emails (of 39,878 total (today))

Monday, October 31st, 2016

Still working on the email graph and at the same time, managed to catch up on the Clinton/Podesta drops by Michael Best, @NatSecGeek, at least for a few hours.

DKIM-verified-podesta-1-24.txt.gz is a sub-set of 9,477 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-24.txt.gz is the full set of Podesta emails to date, some 39,878, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Question: Have you seen any news reports that mention emails being “verified” in their reporting?

Emails in the complete set may be as accurate as those in the verified set, but I would think verification is newsworthy in and of itself.


Parsing Emails With Python, A Quick Tip

Monday, October 31st, 2016

While some stuff runs in the background, a quick tip on parsing email with Python.

I got the following error message from Python:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 301, in parse
res = self._parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 349, in _parse
l = _timelex.split(timestr)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 143, in split
return list(cls(s))
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 137, in next
token = self.get_token()
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 68, in get_token
nextchar =
AttributeError: ‘NoneType’ object has no attribute ‘read’

I have edited the email header in question but it reproduces the original error:

Received: by with SMTP id w14cs34683wfw;
Wed, 5 Nov 2008 08:11:39 -0800 (PST)
Received: by with SMTP id r1mr728791wad.136.1225901498795;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received: from ( [])
by with ESMTP id m26si29354pof.3.2008.;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received-SPF: pass ( domain of designates
Received: from ([])
by with comcast
id bUBY1a0010b6N64A9UBeJl; Wed, 05 Nov 2008 16:11:38 +0000
Received: from ([])
by with comcast
id bUAV1a00L2JMgtY8PUAV7G; Wed, 05 Nov 2008 16:10:30 +0000
X-Authority-Analysis: v=1.0 c=1 a=1Ht49J2nGmlg0oY3xr8A:9
a=8nxvWDfACCTtBObdks-tTUtrMyYA:4 a=OA_lqj45gZcA:10 a=diNjy0DT58-4uIkuavEA:9
a=e0_VUgpf8QEu0XMU188OmzzKrzoA:4 a=37WNUvjkh6kA:10
Received: from [] by;
Wed, 05 Nov 2008 16:10:28 +0000

To: “Podesta” ,
CC: “Denis McDonough OFA” ,”,,
Subject: DOD leadership – immediate attention
Date: Wed, 05 Nov 2008 16:10:28 +0000
Message-Id: <110520081610.3048.4911C574000C2E2100000BE82216>
X-Mailer: AT&T Message Center Version 1 (Oct 30 2007)
X-Authenticated-Sender: c2V3YWxsY29ucm95QGNvbWNhc3QubmV0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=”NextPart_Webmail_9m3u9jl4l_3048_1225901428_0″

Content-Type: text/plain
Content-Transfer-Encoding: 8bit

I’m comparing “Date” to similar emails and getting no joy.

Absence is hard to notice, but once you know the rule, it’s obvious:

RFC822: Standard for ARPA Internet Text Messages says in part:

3. Lexical Analysis of Messages

3.1 General Description

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). (emphasis added)

Yep, the blank line I introduced while removing an errant double-quote on a line by itself, created the start for the body of the message.

Meaning that my Python script failed to find the “Date:” field and returning what someone thought would be a useful error message.

When you get errors parsing emails with Python (and I assume in other languages), check the format of your messages!

RFC822 has an appendix of parsing rules and a few examples.

Suggested listings of the most common email/email header format errors?

Clinton/Podesta Emails 23 and 24, True or False? Cryptographically Speaking

Monday, October 31st, 2016

Catching up on the Clinton/Podesta email releases from Wikileaks, via Michael Best, NatSecGeek. Michael bundles the releases up and posts them at: Podesta emails (zipped).

For anyone coming late to the game, DKIM “verified” means that the DKIM signature on an email is valid for that email.

In lay person’s terms, that email has been proven by cryptography to have originated from a particular mail server and when it left that mail server, it read exactly as it does now, i.e., no changes by Russians or others.

What I have created are files that lists the emails in the order they appear at Wikileaks, with the very next field being True or False on the verification issue.

Just because an email has “False” in the second column doesn’t mean it has been modified or falsified by the Russians.

DKIM signatures fail for all manner of reasons but when they pass, you have a guarantee the message is intact as sent.

For your research into these emails:




For release 24, I did have to remove the DKIM signature on 39256 00010187.eml in order for the script to succeed. That is the only modification I made to either set of files.

Clinton/Podesta Emails – Towards A More Complete Graph (Part 3) New Dump!

Sunday, October 30th, 2016

As you may recall from Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I didn’t check to see if “|” was in use as a separator in the extracted emails subject lines so when I tried to create node lists based on “|” as a separator, it failed.

That happens. More than many people are willing to admit.

In the meantime, a new dump of emails has arrived so I created the new DKIM-incomplete-podesta-1-22.txt.gz file. Which mean picking a new separator to use for the resulting file.

Advice: Check your proposed separator against the data file before using it. I forgot, you shouldn’t.

My new separator? |/|

Which I checked against the file to make sure there would be no conflicts.

The sed commands to remove < and > are the same as in Part 2.

Sigh, back to failure land, again.

Just as one sample:

awk 'FS="|/|" { print $7}'

where is:

9991 00013434.eml|/|False|/|2015-11-21 17:15:25-05:00|/|Eryn Sepp|/|John Podesta|/|Re: Nov 30 / Future Plans / Etc.!|/|


Future Plans

I also checked that with gawk and nawk, with the same result.

For some unknown (to me) reason, all three are treating the first “/” in field 6 (by my count) as a separator, along with the second “/” in that field.

To test that theory, what do you think { print $8 } will return?

You’re right!


So with the “|/|” separator, I’m going to have up to at least 9 fields, perhaps more, varying depending on whether “/” characters occur in the subject line.


That’s not going to work.

OK, so I toss the 10+ MB DKIM-complete-podesta-1-22.txt.gz into Emacs, whose regex treatment I trust, and change “|/|” to “@@@@@” and save that file as DKIM-complete-podesta-1-22-03.txt.

Another sanity check, which got us into all this trouble last time:

awk 'FS="@@@@@" { print $7}' podesta-1-22-03.txt | grep @ | wc -l

returns 36504, which plus the 16 files I culled as failures, equals 36520, the number of files in the Podesta 1-22 release.

Recall that all message-ids contain an @ sign to the correct answer on the number of files gives us confidence the file is ready for further processing.

Apologies for it taking this much prose to go so little a distance.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Our first node for the node list (Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)) was to capture the emails themselves.

Using Message-Id (field 7) as the identifier and Subject (field 6) as its label.

We are about to encounter another problem but let’s walk through it.

An example of what we are expecting:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg;”Re: Tomorrow”;

We have the Message-Id with a closing “;”, followed by the Subject, surrounded in double quote marks and also terminated by a “;”.

FYI: Mixing single and double quotes in awk is a real pain. I struggled with it but then was reminded I can declare variables:

-v dq='"'

which allows me to do this:

awk -v dq='"' 'FS="@@@@@" { print $7 ";" dq $6 dq ";"}' podesta-1-22-03.txt

The awk variable trick will save you considerable puzzling over escape sequences and the like.

Ah, now we are to the problem I mentioned above.

In the part 1 post I mentioned that while:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;;”Re: Tomorrow”;


but having:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;;;”Re: Tomorrow”;;

with Wikileaks links is more convenient for readers.

As you may recall, the last two lines read:

9998 00022160.eml@@@@@False@@@@@2015-06-23 23:01:55-05:00@@@@@Jerome Tatar Tatar Jerome Knotes@@@@@CAC9z1zL9vdT+9FN7ea96r
9999 00013746.eml@@@@@False@@@@@2015-04-03 01:14:56-04:00@@@@@Eryn Sepp Podesta

Which means in addition to printing Message-Id and Subject as fields one and two, we need to split ID on the space and use the result to create the URL back to Wikileaks.

It’s late so I am going to leave you with DKIM-incomplete-podesta-1-22.txt.gz. This is complete save for 16 files that failed to parse. Will repost tomorrow with those included.

I have the first node file script working and that will form the basis for the creation of the edge lists.

PS: Look forward to running awk files tomorrow. It makes a number of things easier.

Exploding the Dakota Access Pipeline Target List

Sunday, October 30th, 2016

Who Is Funding the Dakota Access Pipeline? Bank of America, HSBC, UBS, Goldman Sachs, Wells Fargo by Amy Goodman and Juan Gonz´lez.

Great discussion of the funding behind the North Dakota Pipeline project!

They point to two important graphics to share, the first from: Who’s Banking on the Dakota Access Pipeline?:

view this map on LittleSis

Thanks for the easy embedding! (Best viewed at view this map on LittleSis)

And, from Who’s Banking on the Dakota Access Pipeline? (Food & Water Watch):


The full scale Food & Water Watch image.

Both visualizations are great ways to see some of the culprits responsible for the Dakota Access Pipeline, but not all.

Tracking the funding from Bakken Dakota Access Pipeline back to, among others, Citibank, Credit Agricole, ING Bank, and Natixis, should be a clue as to the next step.

All of the sources of financing, Citibank, Credit Agricole, ING Bank, Natixis, etc., are owned, one way or another, by investors. Moreover, as abstract entities, they cannot act without the use of agents, both as staff and as contractors.

If you take the financing entities as nodes (the first visualization), those should explode into both investor/owners and staff/agents, who do their bidding.

Thinking Citibank, for example, is too large and diffuse a target for effective political, social or economic pressure, but the smaller the part, the greater chance there is to have influence.

It’s true some nation states might be able to call Citibank to heel and if you can whistle one up, give it a shot. But while waiting on you to make you move, the rest of us should be looking for targets more within our reach.

That lesson, the one of the financiers exploding into more manageable targets (don’t overlook their political allies and their extended target lists), the same is true for the staffs and agents of Sunoco Logistics, Energy Transfer Partner, Energy Equity Transfer (a sister partnership to Energy Transfer Pipeline), and Bakken Dakota Access Pipeline.

I have yet to see an abstract entity drive a bulldozer, move pipe, etc. Despite the popular fiction that a corporation is a person, it’s somebody on the ground violating the earth, poisoning the water, disturbing sacred ground, all for the benefit of some other natural person.

Corporations, being abstract entities, cannot feel pressure. Their staffs and contractors, on the other hand, don’t have that immunity.

It will be post-election in the US, but I’m interested in demonstrating and assisting in demonstrating, how to explode these target lists in both directions.

As always, your comments and suggestions are most welcome.

PS: One source for picking up names and entities would be:, which self-describes as: is owned and operated by Energy Media Group which is based in Fargo North Dakota. was brought to life to fill a gap in the way that news was brought to the people on this specific energy niche. We wanted to have a place where people could come for all the news related to the Bakken Shale. Energy Media group owns several hundred websites, most of which surround shale formations all over the world.Our sister site went live at the beginning of December and already has a huge following. In the coming months, will be a place not only for news, but for jobs and classifieds as well. Thank-you for visiting us today, and make sure to sign up for our news letter to get the latest updates and also Like us on Facebook.

To give you the full flavor of their coverage: Oil pipeline protester accused of terrorizing officer:

North Dakota authorities have issued an arrest warrant accusing a pipeline protester on horseback of charging at a police officer.

Mason Redwing, of Fort Thompson, South Dakota, is wanted on felony charges of terrorizing and reckless endangerment in the Sept. 28 incident near St. Anthony. He’s also wanted on a previous warrant for criminal trespass.

The Morton County Sheriff’s Office says the officer shouted a warning and pointed a shotgun loaded with non-lethal beanbag rounds to defuse the situation.

I’m on a horse and you have a shotgun. Who is it that is being terrorized?

Yeah, it’s that kind of reporting.

How To Use Data Science To Write And Sell More Books (Training Amazon)

Sunday, October 30th, 2016

From the description:

Chris Fox is the bestselling author of science fiction and dark fantasy, as well as non-fiction books for authors including Write to Market, 5000 words per hour and today we’re talking about his next book, Six Figure Author: Using data to sell books.

Show Notes What Amazon data science, and machine learning, are and how authors can use them. How Amazon differs from the other online book retailers and how authors can train Amazon to sell more books. What to look for to find a voracious readership. Strategically writing to market and how to know what readers are looking for. On Amazon ads and when they are useful. Tips on writing faster. The future of writing, including virtual reality and AI help with story.

Joanna Penn of The Creative Penn interviews Chris Fox

Some of the highlights:

Training Amazon To Work For You

…What you want to do is figure out, with as much accuracy as possible, who your target audience is.

And when you start selling your book, the number of sales is not nearly as important as who you sell your book to, because each of those sales to Amazon represents a customer profile.

If you can convince them that people who voraciously read in your genre are going to love this book and you sell a couple of hundred copies to people like that, Amazon’s going to take it and run with it. You’ve now successfully trained them about who your audience is because you used good data and now they’re able to easily sell your book.

If, on the other hand, you and your mom buys a copy and your friend at the coffee shop buys a copy, and people who aren’t necessarily into that genre are all buying it, Amazon gets really lost and confused.

Easier said than done but how’s that for taking advantage of someone else’s machine learning?

Chris also has tips for not “polluting” your Amazon sales data.

Discovering and Writing to a Market

How do you find a sub-category or a smaller niche within the Amazon ecosystem? What are the things to look for in order to find a voracious readership?

Chris: What I do is I start looking at the rankings of the number 1, the number 20, 40, 60, 80 and 100 books. You can tell based on where those books are ranked, how many books in the genre are selling. If the number one book is ranked in the top 100 in the store and so is the 20th book, then you’ve found one of the hottest genres on Amazon.

If you find that by the time you get down to number 40, the rank is dropping off sharply, that suggests that not enough books are being produced in that genre and it might be a great place for you to jump in and make a name for yourself. (emphasis in original)

I know, I know, this is a tough one. Especially for me.

As I have pointed out here on multiple occasions, “terrorism” is largely a fiction of both government and media.

However, if you look at the top 100 paid sellers on terrorism at Amazon, the top fifty (50) don’t have a single title that looks like it denies terrorism is a problem.


Which I take to mean, in terms of selling books, services, or data, the terrorism is coming for us all gravy train is the profitable line.

Or at least to indulge in analysis on the basis of “…if the threat of terrorism is real…” and let readers supply their own answers to that question.

There are other valuable tips and asides, so watch the video or read the transcript: How To Use Data Science To Write And Sell More Books With Chris Fox.

PS: As of today, there are 292 podcasts by Jonna Penn.

Clinton/Podesta Emails – 1-22 – Progress Report

Saturday, October 29th, 2016

Michael Best has posted a bulk download of the Clinton/Podesta Eamils, 1-22.

Thinking (foolishly) that I could quickly pickup from where I left off yesterday, Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I grabbed the lasted bulk download and tossed the verification script against it.

In addition to the usual funnies of having to repair known defective files, again, I also underestimated how long verification of DKIM signatures takes on 36,000+ emails. Even on a fast box with plenty of memory.

At this point I have the latest release DKIM signatures parsed, but there are several files that fail for no discernible reason.

I’m going to have another go at it in the morning and should have part 3 of the graph work up tomorrow.

Apologies for the delay!

I Spy A Mirai Botnet

Saturday, October 29th, 2016

Rob Graham created telnetlogger to:

This is a simple program to log login attempts on Telnet (port 23).

It’s designed to track the Mirai botnet. Right now (Oct 23, 2016) infected Mirai machines from around the world are trying to connect to Telnet on every IP address about once per minute. This program logs both which IP addresses are doing the attempts, and which passwords they are using.

I wrote it primarily because installing telnetd on a Raspberry Pi wasn’t sufficient. For some reason, the Mirai botnet doesn’t like the output from Telnet, and won’t try to login. So I needed something that produced the type of Telnet is was expecting. While I was at it, I also wrote some code to parse things and extract the usernames/passwords.


A handy, single purpose program that enables you to spy in Mirai botnets.

Rob has great notes on managing the output.

Perhaps you should publish the passwords you collect (internally) as fair warning to your users.

Or use them in an attempt to hack your own network, before someone else does.


PS: It complies, etc., but even for the pleasure of spying on Mirai botnets, I’m not lowering my shields.

Digital Redlining At Facebook

Saturday, October 29th, 2016

“Redlining” has gone digital.

Facebook Lets Advertisers Exclude Users by Race by Julia Angwin and Terry Parris Jr. illustrates my point that improved technology isn’t making us better people, it’s enabling our bigotry to be practiced in new and more efficient ways.

Julia and Parris write:

Imagine if, during the Jim Crow era, a newspaper offered advertisers the option of placing ads only in copies that went to white readers.

That’s basically what Facebook is doing nowadays.

The ubiquitous social network not only allows advertisers to target users by their interests or background, it also gives advertisers the ability to exclude specific groups it calls “Ethnic Affinities.” Ads that exclude people based on race, gender and other sensitive factors are prohibited by federal law in housing and employment.

It’s a great read and Facebook points out that it wags its policy finger use of:

…the targeting options for discrimination, harassment, disparagement or predatory advertising practices.

“We take a strong stand against advertisers misusing our platform: Our policies prohibit using our targeting options to discriminate, and they require compliance with the law,” said Steve Satterfield, privacy and public policy manager at Facebook. “We take prompt enforcement action when we determine that ads violate our policies.”

Bigots near and far are shaking in their boots, just thinking about the policy finger of Facebook.

In discussion of this modernized form of “redlining,” it may be helpful to know the origin of the term and its impact on society.

Here’s a handy synopsis of the practice:

The FHA also explicitly practiced a policy of “redlining” when determining which neighborhoods to approve mortgages in. Redlining is the practice of denying or limiting financial services to certain neighborhoods based on racial or ethnic composition without regard to the residents’ qualifications or creditworthiness. The term “redlining” refers to the practice of using a red line on a map to delineate the area where financial institutions would not invest (see residential security maps).

The FHA allowed personal and agency bias in favor of all white suburban subdivisions to affect the kinds of loans it guaranteed, as applicants in these subdivisions were generally considered better credit risks. In fact, according to James Loewen in his 2006 book Sundown Towns, FHA publications implied that different races should not share neighborhoods, and repeatedly listed neighborhood characteristics like “inharmonious racial or nationality groups” alongside such noxious disseminates as “smoke, odors, and fog.” One example of the harm done by the FHA is as follows:

In the late 1930’s, as Detroit grew outward, white families began to settle near a black enclave adjacent to Eight Mile Road. By 1940, the blacks were surrounded, but neither they nor the whites could get FHA insurance because of the proximity of an inharmonious racial group. So, in 1941, an enterprising white developer built a concrete wall between the white and black areas. The FHA appraisers then took another look and approved the mortgages on the white properties.

Yes, segregated housing was due in part to official U.S. (not Southern) government policies.

I live near Atlanta, GA. so here’s a portion of an actual “redlining” map:


You can see the full version here.

Racially segregated housing wasn’t a matter of chance or birds of a feather, it was official government policy. Public government policy. They lacked the moral sensitivity to be ashamed of their actions.

There are legitimate targeting ad decisions.

Showing me golf club ads is a lost cause. 😉 As with a number of similar items.

But when does race become a legitimate exclusion category? And for what products?

For more historical data on the Home Owners’ Loan Corporation and a multitude of maps, see: Digital HOLC Maps by LaDale Winling. You may also enjoy his main site: Urban Oasis.

Just so you know, redlining isn’t a racist practice of the distant past. Redlining, a/k/a, housing discrimination, is alive and well today.

Does a 50% discrimination rate in Boston (Mass.) sound like it remains a problem?

PS: New Clinton/Podesta posts are coming! I’m posting while my scripts run in the background. New 3.5 GB dump.

Schneider Electric Unity Pro Targeting Data

Saturday, October 29th, 2016

Major Vulnerability Found in Schneider Electric Utility Pro by Tom Spring should have Open Source Intelligence (OSINT) gurus in high gear.

From the post:

Schneider Electric is grappling with a critical vulnerability found in its flagship industrial controller management software called Unity Pro that allows hackers to remotely execute code on industrial networks.

The warning comes from Indegy, an industrial cybersecurity firm. Indegy discovered the vulnerability and issued a report on the flaw Tuesday. Mille Gandelsman, CTO of Indegy, called the vulnerability a “major concern” and urged anyone running Unity Pro software to update to the latest version. Unity Pro, which runs on Window-based PCs, is used for managing and programing millions of industrial controllers around the world.

“If the IP address of the Windows PC running the Unity Pro software is accessible to the internet, then anyone can exploit the software and run code on hardware,” Gandelsman told Threatpost. “This is the crown jewel of access. An attacker can do anything they want with the controllers themselves.

The flaw resides in a component of Unity Pro software named Unity Pro PLC Simulator, used to test industrial controllers, according to Indegy.

“This is what an attacker would want to have access to in order to impact the actual production process within an ICS physical environment. That includes the valves, turbines, centrifuges and smart meters. These are accessible from the engineering stations natively,” Gandelsman said. “With this type of access, an attacker can use it to change the recipe to drugs being manufactured by industrial control systems or turn off the power grid of a city.”
… (emphasis added)

How is Open Source Intelligence (OSINT) relevant?

Schneider Electric products are found in:

Afghanistan Guatemala Puerto Rico
Albania Guinea Qatar
Algeria Guinea-Bissau Reunion Island
Angola Guyana Romania
Antigua and Barbuda Haïti Russia
Argentina Honduras Rwanda
Armenia Hong Kong Saint Barthelemy
Australia Hungary Saint Lucia
Austria Iceland Saint Martin
Azerbaijan India Saint Pierre and Miquelon
Bahamas Indonesia Saint Vincent and the Grenadines
Bahrain Iran Samoa
Bangladesh Iraq Sao Tome and Principe
Barbados Ireland Saudi Arabia
Belarus Israel Senegal
Belgium Italy Serbia
Benin Ivory Coast Seychelles
Bermuda Jamaica Sierra Leone
Bhutan Japan Singapore
Bolivia Jordan Slovakia
Bosnia-Herzegovina Kazakhstan Slovenia
Botswana Kenya Solomon Islands
Brazil Kosovo Somalia
Brunei Kuwait South Africa
Bulgaria Kyrgyzstan South Korea
Burkina-Faso Laos Spain
Burundi Latvia Sri Lanka
Cambodia Lebanon Sudan
Cameroon Liberia Suriname
Canada Libya Swaziland
Cape Verde Liechtenstein Sweden
Cayman Islands Lithuania Switzerland
Central African Republic Luxembourg Taiwan
Chad Macedonia Tanzania
Chile Madagascar Thailand
China Malawi Togo
Colombia Malaysia Tonga
Comoros Maldives Trinidad and Tobago
Congo Mali Tunisia
Cook Islands Malta Turkey
Costa Rica Martinique Turkmenistan
Croatia Mauritania Turks and Caicos Islands
Cuba Mauritius Tuvalu
Cyprus Mayotte Uganda
Czech Republic Mexico Ukraine
Denmark Moldova United Arab Emirates
Djibouti Monaco United Kingdom
Dominican Republic Mongolia United States
DR of Congo Montenegro Uruguay
Ecuador Montserrat Uzbekistan
Egypt Morocco Vanuatu
El Salvador Mozambique Venezuela
Equatorial Guinea Myanmar Vietnam
Eritrea Namibia Virgin islands
Estonia Nepal Wallis and Futuna
Ethiopia Netherlands Yemen
Fiji New Caledonia Zambia
Finland New Zealand Zimbabwe
France Nicaragua
French Guiana Niger
French Polynesia Nigeria
Gabon Norway
Gambia Oman
Georgia Pakistan
Germany Peru
Ghana Philippines
Greece Poland
Guadeloupe Portugal

Open Source Intelligence (OSINT) techniques can be used to identify and locate Schneider Electric Unity Pro installations, an important step in assessing their vulnerabilities.

Such techniques can provide actionable and valuable intelligence for planners, government officials, risk assessment and other purposes.

In the interest of “responsible disclosure” (read “reserved for paying customers”), I omit my suggestions on the best OSINT techniques for this particular use case.

PS: All versions of the Schneider Electric Unity Pro prior to its latest patch are vulnerable.

Clinton/Podesta Emails – Towards A More Complete Graph (Part 2)

Friday, October 28th, 2016

I assume you are starting with DKIM-complete-podesta-1-18.txt.gz.

If you are starting with another source, you will need different instructions. 😉

First, remember from Clinton/Podesta Emails – Towards A More Complete Graph (Part 1) that I wanted to delete all the < and > signs from the text.

That’s easy enough (uncompress the text first):

sed 's/<//g' DKIM-complete-podesta-1-18.txt > DKIM-complete-podesta-1-18-01.txt

followed by:

sed 's/>//g' DKIM-complete-podesta-1-18-01.txt > DKIM-complete-podesta-1-18-02.txt

Here’s where we started:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner <>|”‘'” <>|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|<A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F>
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin <>|hrcrapid <>|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|<>

Here’s the result after the first two sed scripts:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner|”‘'”|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin|hrcrapid|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|

BTW, I increment the numbers of my result files, DKIM-complete-podesta-1-18-01.txt, DKIM-complete-podesta-1-18-02.txt, because when I don’t, I run different sed commands on the same original file, expecting a cumulative result.

That’s spelled – disappointment and wasted effort looking for problems that aren’t problems. Number your result files.

The nodes and edges mentioned in Clinton/Podesta Emails – Towards A More Complete Graph (Part 1):


  • Emails, message-id is ID and subject is label, make Wikileaks id into link
  • From/To, email addresses are ID and name is label
  • True/False, true/false as ID, True/False as labels
  • Date, truncated to 2015-07-24 (example), date as id and label


  • To/From – Edges with message-id (source) email-address (target) to/from (label)
  • Verify – Edges with message-id (source) true/false (target) verify (label)
  • Date – Edges with message-id (source) – date (target) date (label)

Am I missing anything? The longer I look at problems like this the more likely my thinking/modeling will change.

What follows is very crude, command line creation of the node and edge files. Something more elaborate could be scripted/written in any number of languages.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

You don’t have to take my word for it, try this:

awk 'FS="|" { print $7}' DKIM-complete-podesta-1-18-02.txt

The output prints to the console. Those all look like message-ids to me, well, with the exception of the one that reads ” 24 September.”

How much dirty data do you think is in the message-id field?

A crude starting assumption is that any message-id field without the “@” character is dirty.

Let’s try:

awk ‘FS = “|” { print $7} DKIM-complete-podesta-1-18-02.txt | grep -v @ | wc -l

Which means we are going to extract the 7th field, search (grep) over those results for the “@” sign, where the -v switch means only print lines that DO NOT match, and we will count those lines with wc -l.

Ready? Press return.

I get 594 “dirty” message-ids.

Here is a sampling:

Rolling Stone
MSNBC, Jeff Weaver interview on Sanders health care plan and his Wall St. ad
Texas Tribune
20160120 Burlington, IA Organizing Event
start spreadin’ the news…..
Building Trades Union (Keystone XL)
Rubio Hits HRC On Cuba, Russia
Sourcing Story
Drivers Licenses
Day 1
H1N1 Flu Shot Available

Those look an awful lot like subject lines to me. You?

I suspect those subject lines had the separator character “|” in those lines, before we extracted the data from the .eml files.

I’ve tried to repair the existing files but the cleaner solution is to return to the extraction script and the original email files.

More on that tomorrow!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)

Friday, October 28th, 2016

Gephi is a great tool, but it’s only as good as its input.

The Gephi 8.2 email importer (missing in Gephi 9.*) is lossy, informationally speaking, as I have mentioned before.

Here’s a sample from the verification results on podesta-release-1-18:

9981 00045326.eml|False|2015-07-24 12:58:16-04:00|Oren Shur |John Podesta , Robby Mook , Joel Benenson , “Margolis, Jim” , Mandy Grunwald , David Binder , Teddy Goff , Jennifer Palmieri , Kristina Schake , Christina Reynolds , Katie Connolly , “Kaye, Anson” , Peter Brodnitz , “Rimel, John” , David Dixon , Rich Davis , Marlon Marshall , Michael Halle , Matt Paul , Elan Kriegel , Jake Sullivan |FW: July IA Poll Results|<>

The Gephi 8.2 mail importer fails to create a node representing an email message.

I propose we cure that failure by taking the last field, here:

and the next to last field:

FW: July IA Poll Results

and putting them as id and label, respectively in a node list:; “FW: July IA Poll Results”;

As part of the transformation, we need to remove the < and > signs around the message ID, then add a ; to mark the end of the ID field and put double quote ” “ around the subject to use it as a label. Then close the second field with another ;.

While we are talking about nodes, all the email addresses change from:

Oren Shur

to:; “Oren Shur”;

which are ID and label of the node list, respectively.

I could remove the < and > characters as part of the extraction script but will use sed at the command line instead.

Reminder: Always work on a copy of your data, never the original.

Then we need to create an edge list, one that represents the relationships between the email (as node) to the sender and receivers of the email (also nodes). For this first iteration, I’m going to use labels on the edges to distinguish between senders and receivers.

Assuming my first row of the edges file reads:

Source; Target; Role (I did not use “Type” because I suspect that is a controlled term for Gephi.)

Then the first few edges would read:;>; from;;>; to;;; to;;; to;;; to;

As you can see, this is going to be a “busy” graph! 😉

Filtering is going to play an important role in exploring this graph, so let’s add nodes that will help with that process.

I propose we add to the node list:

true; True
false; False

as id and labels.

Which means for the edge list we can have:; true; verify;

Do you have an opinion on the order, source/target for true/false?

Thinking this will enable us to filter nodes that have not been verified or to include only those that have failed verification.

For experimental purposes, I think we need to rework the date field:

2015-07-24 12:58:16-04:00

I would truncate that to:


and add such truncated dates to the node list:

2015-07-24; 2015-07-24;

as ID and label, respectively.

Then for the edge list:; 2015-07-24; date;

Reasoning that we can filter to include/exclude nodes based on dates, which if you add enough email boxes, could help visualize the reaction to and propagation of emails.

Even assuming named participants in these emails have “deleted” their inboxes, there are always automatic backups. It’s just a question of persistence before the rest of this network can be fleshed out.

Oh, one last thing. You have probably notice the Wikileaks “ID” that forms part of the filename?

9981 00045326.eml

The first part forms the end of a URL to link to the original post at Wikileaks.

Thus, in this example, 9981 becomes:

The general form being:

For the convenience of readers/users, I want to modify my earlier proposal for the email node list entry from:; “FW: July IA Poll Results”;

to:; “FW: July IA Poll Results”;;

Where the third field is “link.”

I am eliding over lots of relationships and subjects but I’m not reluctant to throw it all away and start over.

Your investment in a model isn’t lost by tossing the model, you learn something with every model you build.

Scripting underway, a post on that experience and the node/edge lists to follow later today.

Podesta/Clinton Emails: Filtering by Email Address (Pimping Bill Clinton)

Thursday, October 27th, 2016

The Bill Clinton, Inc. story reminds me of:

Although I steadfastly resist imaging either Bill or Hillary in that video. Just won’t go there!

Where a graph display can make a difference is that instead of just the one email/memo from Bill’s pimp, we can rapidly survey all of the emails in which he appears, in any role.


I ran that on Gephi 8.2 against podesta-release-1-18 but the results were:

Nodes 0, Edges 0.

Hmmm, there is something missing, possibly the CSV file?

I checked and podesta-release-1-18 has 393 emails where appears.

Could try to find the “right” way to filter on email addresses but for now, let’s take a dirty short-cut.

I created a directory to hold all emails with and ran into all manner of difficulties because the file names are plagued with spaces!

So much so that I unintentionally (read “by mistake”) saved all the posts from podesta-release-1-18 to a different folder than the ones from podesta-release-19.


Well, but there is a happy outcome and an illustration of yet another Gephi capability.

I build the first graph from the posts from podesta-release-1-18 and then with that graph open, imported the from podesta-release-19 and appended those results to the open graph.

How cool is that!

Imagine doing that across data sets, assuming you paid close attention to identifiers, etc.

Sorry, back to the graphs, here is the random layout once the graphs were combined:


Applying the Yifan Hu network layout:


I ran network statistics on network diameter and applied colors based on betweenness:


And finally, adjusted the font and turned on the labels:


I have spent a fair amount of time just moving stuff about but imagine if you could interactively explore the emails, creating and trashing networks based on to:, from:, cc:, dates, content, etc.

The limits of Gephi imports were a major source of pain today.

I’m dodging those tomorrow in favor of creating node and adjacency tables with scripts.

PS: Don’t John Podesta and Doug Band look like two pimps in a pod? 😉

PPS: If you haven’t read the pimping Bill Clinton memo. (I think it has some other official title.)

Another Day, Another Law To Ignore – Burner Drones Anyone?

Thursday, October 27th, 2016

Sweden bans cameras on drones, deeming it illegal surveillance by Lisa Vaas.

From the post:

Sweden last week banned the use of camera drones without a special permit, infuriating hobby flyers and an industry group but likely pleasing privacy campaigners.

Drone pilots will now have to show that there’s a legitimate benefit that outweighs the public’s right to privacy – and there are no exemptions for journalists, nor any guarantee that a license will be granted.

The cost of a license depends on variables such as the takeoff weight of the drone and whether it’s going to be flown further than the pilot can see, and none of the licenses are cheap. Costs range from an annual license fee of €1,200 right up to a maximum hourly fee of €36,000.

UAS Sweden (Unmanned Aerial System – SWEDEN) has objected to the ruling on the potential for loss of jobs.

The interests of the industry will be better met with development and advocacy of burner drones. Similar to a burner cellphone, it isn’t intended for recovery/re-use.

Burner drones are critical to reporting on government attacks like the one imminent on #NoDAPL camps (North Dakota).

Burner drones keep journalists beyond the reach of batons, tear gas and water canon, all good things.

Just searching quickly, Airblock has the right idea but its capabilities are too limited to make an effective burner drone for journalists.

Something on that order, with a camera, longer range/duration, modular is good, especially if you can add on parts that “bite.”

Privacy advocates miss the fact there is no privacy in the face of modern government surveillance. Banning drones only reduces the ability of people to counter-spy upon their less than truthful governments.

In case you are interested, the administrative court ruling in question:

The organization of camera on a drone but not for the camera in a car


The Supreme Administrative Court has in two judgments found that a camera mounted on a drone requires a permit under camera surveillance law while a camera mounted behind the windscreen of a car or on a bicycle handlebar does not need permission.

Please ping me with notices of burner drone projects. Thanks!

Parsing JSON is a Minefield

Wednesday, October 26th, 2016

Parsing JSON is a Minefield by Nicolas Seriot.


JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming. But how well do you really know JSON? We’ll read the specifications and write test cases together. We’ll test common JSON libraries against our test cases. I’ll show that JSON is not the easy, idealised format as many do believe. Indeed, I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because JSON libraries rely on specifications that have evolved over time and that let many details loosely specified or not specified at all.
(emphasis in original)

Or the summary (tweet) that caught my attention:

I published: Parsing JSON is a Minefield … in which I could not find two parsers that exhibited the same behaviour

Or consider this graphic, which in truth needs a larger format than even the original:


Don’t worry, you can’t read the original at its default resolution. I had to enlarge the view several times to get a legible display.

More suitable for a poster sized print.

Perhaps something to consider for Balisage 2017 as swag?

Excellent work and a warning against the current vogue of half-ass standardization in some circles.

“We know what we meant” is a sure sign of poor standards work.

Clinton/Podesta 19, DKIM-verified-podesta-19.txt.gz, DKIM-complete-podesta-19.txt.gz

Wednesday, October 26th, 2016

Michael Best, @NatSecGeek, posted release 19 of the Clinton/Podesta emails at: today.

A total of 1518 emails, zero (0) of which broke my script!

Three hundred and sixty-three were DKIM verified! DKIM-verified-podesta-19.txt.gz.

The full set of emails, verified and not: DKIM-complete-podesta-19.txt.gz.

I’m still pondering how to best organize the DKIM verified material for access.

I could segregate “verified” emails for indexing. So any “hits” from those searches are from “verified” emails?

Ditto for indexing only attachments of “verified” emails.

What about a graph constructed solely from “verified” emails?

Or should I make verified a property of the emails as nodes? Reasoning that aside from exploring the email importation in Gephi 8.2, it would not be that much more difficult to build node and adjacency lists from the raw emails.


Serious request for help.

Like Gollum, I know what I keep in my pockets, but I have no idea what other people keep in theirs.

What would make this data useful to you?

No Frills Gephi (8.2) Import of Clinton/Podesta Emails (1-18)

Wednesday, October 26th, 2016

Using Gephi 8.2, you can create graphs of the Clinton/Podesta emails based on terms in subject lines or the body of the emails. You can interactively work with all 30K+ (as of today) emails and extract networks based on terms in the posts. No programming required. (Networks based on terms will appear tomorrow.)

If you have Gephi 8.2 (I can’t find the import spigot in 9.0 or 9.1), you can import the Clinton/Podesta Emails (1-18) for analysis as a network.

To save you the trouble of regressing to Gephi 8.2, I performed a no frills/default import and exported that file as podesta-1-18-network.gephi.gz.

Download and uncompress podesta-1-18-network.gephi.gz, then you can pickup at timemark 3.49.

Open the file (your location may differ):


Obligatory hair-ball graph visualization. 😉


Considerably less appealing that Jennifer Golbeck’s but be patient!

First step, Layout -> Yifan Hu. My results:


Second step, Network Diameter statistics (right side, run).

No visible impact on the graph but, now you can change the color and size of nodes in the graph. That is they have attributes on which you can base the assignment of color and size.

Tutorial gotcha: Not one of Jennifer’s tutorials but I was watching a Gephi tutorial that skipped the part about running statistics on the graph prior to assignment of color and size. Or I just didn’t hear it. The menu options appear in documentation but you can’t access them unless and until you run network statistics or have attributes for the assignment of color and size. Run statistics first!

Next, assign colors based on betweenness centrality:


The densest node is John Podesta, but if you remove his node, rerun the network statistics and re-layout the graph, here is part of what results:


A no frills import of 31,819 emails results in a graph of 3235 nodes and 11,831 edges.

That’s because nodes and edges combine (merge to you topic map readers) when they have the same identifier or for edges are between the same nodes.

Subject to correction, when that combining/merging occurs, the properties on the respective nodes/edges are accumulated.

Topic mappers already realize there are important subjects missing, some 31,819 of them. That is the emails themselves don’t by default appear as nodes in the network.

Ian Robinson, Jim Webber & Emil Eifrem illustrate this lossy modeling in Graph Databases this way:


Modeling emails without the emails is rather lossy. 😉

Other nodes/subjects we might want:

  • Multiple to: emails – Is who was also addressed important?
  • Multiple cc: emails – Same question as with to:.
  • Date sent as properties? So evolution of network/emails can be modeled.
  • Capture “reply-to” for relationships between emails?

Other modeling concerns?

Bear in mind that we can suppress a large amount of the detail so you can interactively explore the graph and only zoom into/display data after finding interesting patterns.

Some helpful links: The email collection as bulk download, thanks to Michael Best, @NatSecGeek. Where you can grab a copy of Gephi 8.2.

Spying On Government Oppression: Est. Oct. 25 – Nov. 4, 2016 – #NoDAPL camps (North Dakota)

Wednesday, October 26th, 2016

Police forces have suborned the FAA into declaring a no-fly zone in a seven (7) mile radius of #NoDAPL camps in North Dakota.

Truthful video of unprovoked violence against peaceful protesters may “interfere with the election,” and/or their continuance of this cultural/environment/social outrage.

The FAA posted this helpful map:


James Peach reports contact information in FAA Issues “No Fly Zone” Over Area of DAPL Protests at Standing Rock, ND:

FAA Regional Office
Contact: Laurie Suttmeier
Telephone: (701)-667-3224

“No Fly Zone” Notice –
FAA Temporary Flight Restrictions

Morton Country Sheriff’s Department
Facebook Page: Morton County Sheriff’s Department on FB
Contact: Sheriff Kyle Kirchmeier
Telephone: (701) 667-3330

In case you are behind on this particular government crime against a sovereign people and its own citizens, catch up with: The fight over the Dakota Access Pipeline, explained by Brad Plumer or, Life in the Native American oil protest camps (BBC).

The deployment of government forces is fluid so drones are essential to effective resistance.

And to capture and stream government atrocities in real time.

Small wonder the FAA is a co-conspirator in this criminal enterprise.

Clinton/Podesta 1-18,,

Tuesday, October 25th, 2016

After a long day of waiting for scripts to finish and re-running them to cross-check the results, I am happy to present:

DKIM-verified-podesta-1-18.txt.gz, which consists of the Podesta emails (7526) which returned true for a test of their DKIM signature.

The complete set of the results for all 31,819 emails, can be found in:


An email that has been “verified” has a cryptographic guarantee that it was sent even as it appears to you now.

An email that fails verification, may be just as trustworthy, but its DKIM signature has failed for any number of reasons.

One of my motivations for classifying these emails is to enable the exploration of why DKIM verification failed on some of these emails.

Question: What would make this data more useful/accessible to journalists/bloggers?

I ask because dumping data and/or transformations of data can be useful, it is synthesizing data into a coherent narrative that is the essence of journalism/reporting.

I would enjoy doing the first in hopes of furthering the second.

PS: More emails will be added to this data set as they become available.

Corrupt (fails with my script) files in Clinton/Podesta Emails (14 files out of 31,819)

Tuesday, October 25th, 2016

You may use some other definition of “file corruption” but that’s mine and I’m sticking to it.


The following are all the files that failed against my script and the actions I took to proceed with parsing the files. Not today but I will make a sed script to correct these files as future accumulations of emails appear.

13544 00047141.eml

Date string parse failed:

Date: Wed, 17 Dec 2008 12:35:42 -0700 (GMT-07:00)

Deleted (GMT-07:00).

15431 00059196.eml

Date string parse failed:

Date: Tue, 22 Sep 2015 06:00:43 +0800 (GMT+08:00)

Deleted (GMT+8:00).

155 00049680.eml

Date string parse failed:

Date: Mon, 27 Jul 2015 03:29:35 +0000

Assuming, as the email reports, was the sender and was the intended receiver, then the offset from UT is clearly wrong (+0000).

Deleted +0000.

6793 00059195.eml

Date string parse fail:

Date: Tue, 22 Sep 2015 05:57:54 +0800 (GMT+08:00)

Deleted (GTM+08:00).

9404 0015843.eml DKIM failure

All of the DKIM parse failures take the form:

Traceback (most recent call last):
File “”, line 18, in
verified = dkim.verify(data)
File “/usr/lib/python2.7/dist-packages/dkim/”, line 604, in verify
return d.verify(dnsfunc=dnsfunc)
File “/usr/lib/python2.7/dist-packages/dkim/”, line 506, in verify
File “/usr/lib/python2.7/dist-packages/dkim/”, line 181, in validate_signature_fields
if int(sig[b’x’]) < int(sig[b't']): KeyError: 't'

I simply deleted the DKIM-Signature in question. Will go down that rabbit hole another day.

21960 00015764.eml

DKIM signature parse failure.

Deleted DKIM signature.

23177 00015850.eml

DKIM signature parse failure.

Deleted DKIM signature.

23728 00052706.eml

Invalid character in RFC822 header.

I discovered an errant ‘”‘ (double quote mark) at the start of a line.

Deleted the double quote mark.

And deleted ^M line endings.

25040 00015842.eml

DKIM signature parse failure.

Deleted DKIM signature.

26835 00015848.eml

DKIM signature parse failure.

Deleted DKIM signature.

28237 00015840.eml

DKIM signature parse failure.

Deleted DKIM signature.

29052 0001587.eml

DKIM signature parse failure.

Deleted DKIM signature.

29099 00015759.eml

DKIM signature parse failure.

Deleted DKIM signature.

29593 00015851.eml

DKIM signature parse failure.

Deleted DKIM signature.

Here’s an odd pattern for you, all nine (9) of the fails to parse the DKIM signatures were on mail originating from:

From: Gene Karpinski

But there are approximately thirty-three (33) emails from Karpinski so it doesn’t fail every time.

The file numbers are based on the 1-18 distribution of Podesta emails created by Michael Best, @NatSecGeek, at: Podesta Emails (zipped).

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Tuesday, October 25th, 2016

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.


Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:


sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)


So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):


for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Monday, October 24th, 2016

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.


import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data =
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")


Download podesta-test.tar.gz, unpack that to a directory and then save/uppack to the same directory, then:


Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

Data Science for Political and Social Phenomena [Special Interest Search Interface]

Sunday, October 23rd, 2016

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

Boosting (in Machine Learning) as a Metaphor for Diverse Teams [A Quibble]

Sunday, October 23rd, 2016

Boosting (in Machine Learning) as a Metaphor for Diverse Teams by Renee Teate.

Renee’s summary:

tl;dr: Boosting ensemble algorithms in Machine Learning use an approach that is similar to assembling a diverse team with a variety of strengths and experiences. If machines make better decisions by combining a bunch of “less qualified opinions” vs “asking one expert”, then maybe people would, too.

Very much worth your while to read at length but to setup my quibble:

What a Random Forest does is build up a whole bunch of “dumb” decision trees by only analyzing a subset of the data at a time. A limited set of features (columns) from a portion of the overall records (rows) is used to generate each decision tree, and the “depth” of the tree (and/or size of the “leaves”, the number of examples that fall into each final bin) is limited as well. So the trees in the model are “trained” with only a portion of the available data and therefore don’t individually generate very accurate classifications.

However, it turns out that when you combine the results of a bunch of these “dumb” trees (also known as “weak learners”), the combined result is usually even better than the most finely-tuned single full decision tree. (So you can see how the algorithm got its name – a whole bunch of small trees, somewhat randomly generated, but used in combination is a random forest!)

All true but “weak learners” in machine learning are easily reconfigured, combined with different groups of other “weak learners,” or even discarded.

None of which is true for people who are hired to be part of a diverse team.

I don’t mean to discount Renee’s metaphor because I think it has much to recommend it, but diverse “weak learners” make poor decisions too.

Don’t take my word for it, watch the 2016 congressional election results.

Be sure to follow Renee on @BecomingDataSci. I’m interested to see how she develops this metaphor and where it leads.


Monetizing Twitter Trolls

Sunday, October 23rd, 2016

Alex Hern‘s coverage of Twitter’s fail-to-sell story, Did trolls cost Twitter $3.5bn and its sale?, is a typical short on facts story about abuse on Twitter.

When I say short on facts, I don’t deny any of the anecdotal accounts of abuse on Twitter and other social media.

Here’s the data problem with abuse at Twitter:

As of May of 2016, Twitter had 310 million active monthly users over 1.3 billion accounts.

Number of Twitter users who are abusive (trolls): unknown

Number of Twitter users who are victims: unknown

Number of abusive tweets, daily/weekly/monthly: unknown

Type/frequency of abusive tweets, language, images, disclosure: unknown

Costs to effectively control trolls: unknown

Trolls and abuse should be opposed both at Twitter and elsewhere, but without supporting data, creating corporate priorities and revenues to effectively block (not end, block) abuse isn’t possible.

Since troll hunting at present is a drain on the bottom line with no return for Twitter, what if Twitter were to monetize its trolls?

That is create a mechanism whereby trolls became the drivers of a revenue stream from Twitter.

One such approach would be to throw off all the filtering that Twitter does as part of its basic service. If you have Twitter basic service, you will see posts from everyone from committed jihadists to the Federal Reserve. Not blocked accounts, no deleted accounts, etc.

Twitter removes material under direct court order only. Put the burden and expense on going to court for every tweet on both individuals and governments. No exceptions.

Next, Twitter creates the Twitter+ account, where for an annual fee, users can access advanced filtering that includes blocking people, language, image analysis of images posted to them, etc.

Price point experiments should set the fees for Twitter+ accounts. Filtering will be a decision based on real revenue numbers. Not flights of fancy by the Guardian or Sales Force.

BTW, the open Twitter I suggest creates more eyes for ads, which should also improve the bottom line at Twitter.

An “open” Twitter will attract more trolls and drive more users to Twitter+ accounts.

Twitter trolls generate the revenue to fight them.

I rather like that.


Twitter Logic: 1 call on Github v. 885,222 calls on Twitter

Sunday, October 23rd, 2016

Chris Albon’s collection of 885,222 tweets (ids only) for the third presidential debate of 2016 proves bad design decisions aren’t only made inside the Capital Beltway.

Chris could not post his tweet collection, only the tweet ids under Twitter’s terms of service.

The terms of service reference the Developer Policy and under that policy you will find:

F. Be a Good Partner to Twitter

1. Follow the guidelines for using Tweets in broadcast if you display Tweets offline.

2. If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.

a. You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.

b. Any Content provided to third parties via non-automated file download remains subject to this Policy.
…(emphasis added)

Just to be clear, I find Twitter extremely useful for staying current on CS research topics and think developers should be “…good partners to Twitter.”

However, Chris is prohibited from posting a data set of 885,222 tweets on Gibhub, where users could download it with no impact on Twitter, versus every user who want to explore that data set must submit 885,222 requests to Twitter servers.

Having one hit on Github for 885,222 tweets versus 885,222 on Twitter servers sounds like being a “good partner” to me.

Multiple that by all the researchers who are building Twitter data sets and the drain on Twitter resources grows without any benefit to Twitter.

It’s true that someday Twitter might be able to monetize references to its data collections, but server and bandwidth expenses are present line items in their budget.

Enabling the distribution of full tweet datasets is one step towards improving their bottom line.

PS: Please share this with anyone you know at Twitter. Thanks!

Political Noise Data (Tweets From 3rd 2016 Presidential Debate)

Sunday, October 23rd, 2016

Chris Albon has collected data on 885,222 debate tweets from the third Presidential Debate of 2016.

As you can see from the transcript, it wasn’t a “debate” in any meaningful sense of the term.

The quality of tweets about that debate are equally questionable.

However, the people behind those tweets vote, buy products, click on ads, etc., so despite my title description as “political noise data,” it is important political noise data.

To conform to Twitter terms of service, Chris provides the relevant tweet ids and a script to enable construction of your own data set.

BTW, Chris includes his Twitter mining scripts.


Validating Wikileaks Emails [Just The Facts]

Saturday, October 22nd, 2016

A factual basis for reporting on alleged “doctored” or “falsified” emails from Wikileaks has emerged.

Now to see if the organizations and individuals responsible for repeating those allegations, some 260,000 times, will put their doubts to the test.

You know where my money is riding.

If you want to verify the Podesta emails or other email leaks from Wikileaks, consult the following resources.

Yes, we can validate the Wikileaks emails by Robert Graham.

From the post:

Recently, WikiLeaks has released emails from Democrats. Many have repeatedly claimed that some of these emails are fake or have been modified, that there’s no way to validate each and every one of them as being true. Actually, there is, using a mechanism called DKIM.

DKIM is a system designed to stop spam. It works by verifying the sender of the email. Moreover, as a side effect, it verifies that the email has not been altered.

Hillary’s team uses “”, which as DKIM enabled. Thus, we can verify whether some of these emails are true.

Recently, in response to a leaked email suggesting Donna Brazile gave Hillary’s team early access to debate questions, she defended herself by suggesting the email had been “doctored” or “falsified”. That’s not true. We can use DKIM to verify it.

Bob walks you through validating a raw email from Wikileaks with the DKIM verifier plugin for Thunderbird. And demonstrating the same process can detect “doctored” or “falsified” emails.

Bob concludes:

I was just listening to ABC News about this story. It repeated Democrat talking points that the WikiLeaks emails weren’t validated. That’s a lie. This email in particular has been validated. I just did it, and shown you how you can validate it, too.

Btw, if you can forge an email that validates correctly as I’ve shown, I’ll give you 1-bitcoin. It’s the easiest way of solving arguments whether this really validates the email — if somebody tells you this blogpost is invalid, then tell them they can earn about $600 (current value of BTC) proving it. Otherwise, no.

BTW, Bob also points to:

Here’s Cryptographic Proof That Donna Brazile Is Wrong, WikiLeaks Emails Are Real by Luke Rosiak, which includes this Python code to verify the emails:



Verifying Wikileaks DKIM-Signatures by teknotus, offers this manual approach for testing the signatures:


But those are all one-off methods and there are thousands of emails.

But the post by teknotus goes on:

Preliminary results

I only got signature validation on some of the emails I tested initially but this doesn’t necessarily invalidate them as invisible changes to make them display correctly on different machines done automatically by browsers could be enough to break the signatures. Not all messages are signed. Etc. Many of the messages that failed were stuff like advertising where nobody would have incentive to break the signatures, so I think I can safely assume my test isn’t perfect. I decided at this point to try to validate as many messages as I could so that people researching these emails have any reference point to start from. Rather than download messages from wikileaks one at a time I found someone had already done that for the Podesta emails, and uploaded zip files to

Emails 1-4160
Emails 4161-5360
Emails 5361-7241
Emails 7242-9077
Emails 9078-11107

It only took me about 5 minutes to download all of them. Writing a script to test all of them was pretty straightforward. The program dkimverify just calls a python function to test a message. The tricky part is providing context, and making the results easy to search.

Automated testing of thousands of messages

It’s up on Github

It’s main output is a spreadsheet with test results, and some metadata from the message being tested. Results Spreadsheet 1.5 Megs

It has some significant bugs at the moment. For example Unicode isn’t properly converted, and spreadsheet programs think the Unicode bits are formulas. I also had to trap a bunch of exceptions to keep the program from crashing.

Warning: I have difficulty opening the verify.xlsx file. In Calc, Excel and in a CSV converter. Teknotus reports it opens in LibreOffice Calc, which just failed to install on an older Ubuntu distribution. Sing out if you can successfully open the file.

Journalists: Are you going to validate Podesta emails that you cite? Or that others claim are false/modified?

Python and Machine Learning in Astronomy (Rejuvenate Your Emotional Health)

Saturday, October 22nd, 2016

Python and Machine Learning in Astronomy (Episode #81) (Jack VanderPlas)

From the webpage:

The advances in Astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We have learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein’s theory of general relativity is correct.

It probably won’t surprise you to learn that Python and data science play a central role in modern day Astronomy. This week you’ll meet Jake VanderPlas, an astrophysicist and data scientist from University of Washington. Join Jake and me while we discuss the state of Python in Astronomy.

Links from the show:

Jake on Twitter: @jakevdp

Jake on the web:

Python Data Science Handbook:

Python Data Science Handbook on GitHub:

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data:

PyData Talk:

eScience Institue: @UWeScience

Large Synoptic Survey Telescope:

AstroML: Machine Learning and Data Mining for Astronomy:

Astropy project:

altair package:

If you social media feeds have been getting you down, rejoice! This interview with Jake VanderPlas covers Python, machine learning and astronomy.

Nary a mention of current social dysfunction around the globe!

Replace an hour of TV this weekend with this podcast. (Or more hours with others.)

Not only will you have more knowledge, you will be in much better emotional shape to face the coming week!

Validating Wikileaks/Podesta Emails

Friday, October 21st, 2016

A quick heads up that Robert Graham is working on:


While we wait for that post to appear at Errata Security, you should also take a look at DomainKeys Identified Mail (DKIM).

From the homepage:

DomainKeys Identified Mail (DKIM) lets an organization take responsibility for a message that is in transit. The organization is a handler of the message, either as its originator or as an intermediary. Their reputation is the basis for evaluating whether to trust the message for further handling, such as delivery. Technically DKIM provides a method for validating a domain name identity that is associated with a message through cryptographic authentication.

In particular, review RFC 5585 DomainKeys Identified Mail (DKIM) Service Overview. T. Hansen, D. Crocker, P. Hallam-Baker. July 2009. (Format: TXT=54110 bytes) (Status: INFORMATIONAL) (DOI: 10.17487/RFC5585), which notes:

2.3. Establishing Message Validity

Though man-in-the-middle attacks are historically rare in email, it is nevertheless theoretically possible for a message to be modified during transit. An interesting side effect of the cryptographic method used by DKIM is that it is possible to be certain that a signed message (or, if l= is used, the signed portion of a message) has not been modified between the time of signing and the time of verifying. If it has been changed in any way, then the message will not be verified successfully with DKIM.

In a later tweet, Bob notes the “DKIM verifier” add-on for Thunderbird.

Any suggestions on scripting DKIM verification for the Podesta emails?

That level of validation may be unnecessary since after more than a week of “…may be altered…,” not one example of a modified email has surfaced.

Some media outlets will keep repeating the “…may be altered…” chant, along with attribution of the DNC hack to Russia.

Noise but it is a way to select candidates for elimination from your news feeds.