Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 18, 2019

Data Mining Relevance Practice – Iraq War Report

Filed under: Data Mining,Relevance,Search Data,Text Mining — Patrick Durusau @ 9:47 pm

By now you realize how useless relevancy at the “document” level can be, considering documents can be ten, twenty, hundreds or even thousands of pages long.

Highly relevant “hits” are great, but are you going to read every page of every document?

The main report on the Iraq War, The U.S. Army in the Iraq War – Volume 1: Invasion – Insurgency – Civil War, 2003-2006 and The U.S. Army in the Iraq War — Volume 2: Surge and Withdrawal, 2007-2011, totals out at about 1,400+ pages.

Along with the report, nearly 30,000 unclassified documents used in the writing of the report are also available.

Other than being timely, the advantage for data miners is the report, while a bit long, is readable and you know in advance the ~30,000 documents are relevant to that report. Ignoring footnotes (that’s cheating), which documents go with which pages of the report? You can check your answers against the footnotes.

For bonus points, what pages of the ~30,000 documents should go with which pages of the report? They weren’t citing the entire document like some search engines, but particular pages.

And no, I haven’t loaded these documents but hope to this weekend.

PS: The Army War College Publications office has a remarkable range of very high quality publications.

January 16, 2019

Finding Bias in Data Mining/Science: An Exercise

Filed under: Bias,Data Mining — Patrick Durusau @ 9:34 pm

I follow the MIT Technology Review on Twitter: @techreview and was amazed to see this AM:

Period of relative peace?

Really? Smells like someone has been cooking the numbers! (Another way to say bias in data mining/science.)

Unlike annual reports from corporations and foundations, you read the MIT post in full, Data mining adds evidence that war is baked into the structure of society, the original paper, Pattern Analysis of World Conflicts over the past 600 years by Gianluca Martelloni, Francesca Di Patti, and Ugo Bardi, plus you can access the data used for the paper: Conflict by Dr. Peter Brecke.

From general news awareness, the claim of “…period of relative peace…” should trigger skepticism on your part. I take it as a generally accepted fact that the United States was bombing somewhere in the world every day of the Obama administration and that has continued under the current U.S. president. It’s relatively peaceful in New York, London, and Berlin, but other places in the world, not so much.

I skimmed the original article and encountered this remark: “…a relatively peaceful period during the 18th century is noticeable.” I don’t remember 18th history all that well but that strikes me as inconsistent with what I do remember. I have to wonder who was peaceful and who was not in the data set. Not saying it is wrong from one view of the data, but what underlies that statement?

Take this article, along with its data set as an exercise in finding bias in data mining/science. Bias doesn’t mean the paper or its conclusions are necessarily wrong, but choices were made with regard to the data and that shaped their paper and its conclusions.

PS: A cursory glace at the paper also finds the data used ends with the year 2000. Small comfort to the estimated 32 million Muslims who have died since 9/11, during this “period of relative peace.” You need to ask peace for who and at what price?

August 1, 2018

…R Clients for Web APIs

Filed under: Data Mining,R,Web Applications — Patrick Durusau @ 3:35 pm

Harnessing the Power of the Web via R Clients for Web APIs by Lucy D’Agostino McGowan.

Abstract:

We often want to harness the power of the internet in our daily data practices, i.e., collect data from the internet, share data on the internet, let a dataset evolve on the internet and analyze it periodically, put products up on the internet, etc. While many of these goals can be achieved in a browser via mouse clicks, these practices aren’t very reproducible and they don’t scale, as they are difficult to capture and replicate. Most of what can be done in a browser can also be implemented with code. Web application programing interfaces (APIs) are one tool for facilitating this communication in a reproducible and scriptable way. In this talk we will discuss the general framework of common R clients for web APIs, as well as dive into specific examples. We will focus primarily on the googledrive package, a package that allows the user to control their Google Drive from the comfort of their R console, as well as other common R clients for web APIs, while discussing best practices for efficient and reproducible coding.

The ability to document and replicate acquisition of data is a “best practice,” until you have acquired data you prefer to not be attributed to you. 😉

For cases where the “best practice” obtains, consult McGowan’s slides.

May 13, 2018

When Sed Appears To Lie (It’s Not Lying)

Filed under: Data Mining — Patrick Durusau @ 7:34 pm

I prefer Unix tools, bash scripts and sed in particular, for mining text files.

But most of my sed scripts are ad hoc and ran at the command line. But I needed to convert text extracted from PDF (gs) for import into a spreadsheet.

I had 21 invocations of sed that started with:

sed -i 's/Ad\ ID\ //' $f

All the other scripts up to that point had run flawlessly so I was unprepared for:

sed: -e expression #1, char 73: unterminated `s' command

I love command line tools but error messages are not their strong point.

Disclosure: Yes, yes I did have an error in one of the sed regexes, but it was on line #15, not #1.

Ok, ok, laugh it up. The error message was correct because each line counts as a separate “expression #1.”

I did find the error but only by testing each regex.

Sed scripting tip: In a series of sed invocations, each invocation is “expression #1.”

Hope that saves you from looking for exotic problems with your sed distribution, interaction with your shell escapes, etc. (Yeah, all that and more.)

February 15, 2017

Unmet Needs for Analyzing Biological Big Data… [Data Integration #1 – Spells Market Opportunity]

Filed under: BigData,Bioinformatics,Biomedical,Data Integration,Data Management,Data Mining — Patrick Durusau @ 8:09 am

Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators by Lindsay Barone, Jason Williams, David Micklos.

Abstract:

In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principle investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multi-step workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

In particular, needs topic maps can address rank #1, #2, #6, #7, and #10, or as found by the authors:


A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Figure 3). Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HP computing (71%) were the three greatest unmet needs. High performance computing was an unmet need for only 27% of PIs—with similar percentages across disciplines, different sized groups, and NSF programs.

or graphically (figure 3):

So, cloud, distributed, parallel, pipelining, etc., processing is insufficient?

Pushing undocumented and unintegratable data at ever increasing speeds is impressive but gives no joy?

This report will provoke another round of Esperanto fantasies, that is the creation of “universal” vocabularies, which if used by everyone and back-mapped to all existing literature, would solve the problem.

The number of Esperanto fantasies and the cost/delay of back-mapping to legacy data defeats all such efforts. Those defeats haven’t prevented repeated funding of such fantasies in the past, present and no doubt the future.

Perhaps those defeats are a question of scope.

That is rather than even attempting some “universal” interchange of data, why not approach it incrementally?

I suspect the PI’s surveyed each had some particular data set in mind when they mentioned data integration (which itself is a very broad term).

Why not seek out, develop and publish data integrations in particular instances, as opposed to attempting to theorize what might work for data yet unseen?

The need topic maps wanted to meet remains unmet. With no signs of lessening.

Opportunity knocks. Will we answer?

December 21, 2016

Mining Twitter Data with Python [Trump Years Ahead]

Filed under: Data Mining,Python,Twitter — Patrick Durusau @ 5:24 pm

Marco Bonzanini, author of Mastering Social Media Mining with Python, has a seven part series of posts on mining Twitter with Python.

If you haven’t been mining Twitter before now, President-elect Donald Trump is about to change all that.

What if Trump continues to tweet as President and authorizes his appointees to do the same? Spontaneity isn’t the same thing as openness but it could prove to be interesting.

November 20, 2016

Refining The Dakota Access Pipeline Target List

Filed under: #DAPL,Data Mining,Government,Politics — Patrick Durusau @ 3:56 pm

I mentioned in Exploding the Dakota Access Pipeline Target List that while listing of the banks financing Dakota Access Pipeline is great, banks and other legal entities are owned, operated and act through people. People, who unlike abstract legal entities, are subject to the persuasion of other people.

Unfortunately, almost all discussions of #DAPL focus on the on-site brutality towards Native Americans and/or the corporations involved in the project.

The protesters deserve our support but resisting local pawns (read police) may change the route of the pipeline, but it won’t stop the pipeline.

To stop the Dakota Access Pipeline, there are only two options:

  1. Influence investors to abandon the project
  2. Make the project prohibitively expensive

In terms of #1, you have to strike through the corporate veil to reach the people who own and direct the affairs of the corporation.

“Piercing the corporate veil” is legal terminology but I mean it as in knowing the named and located individuals are making decisions for a corporation and the named and located individuals who are its owners.

A legal fiction, such as a corporation, cannot feel public pressure, distress, social ostracism, etc., all things that people are subject to suffering.

Even so, persuasion can only be brought to bear on named and located individuals.

News reports giving only corporate names and not individual owners/agents creates a boil of obfuscation.

A boil of obfuscation that needs lancing. Shall we?

To get us off on a common starting point, here are some resources I will be reviewing/using:

Corporate Research Project

The Corporate Research Project assists community, environmental and labor organizations in researching companies and industries. Our focus is on identifying information that can be used to advance corporate accountability campaigns. [Sponsors Dirt Diggers Digest]

Dirt Diggers Digest

chronicling corporate misbehavior (and how to research it) [blog]

LittleSis

LittleSis* is a free database of who-knows-who at the heights of business and government.

* opposite of Big Brother

OpenCorporates

The largest open database of companies in the world [115,419,017 companies]

Revealing the World of Private Companies by Sheila Coronel

Coronel’s blog post has numerous resources and links.

She also points out that the United States is a top secrecy destination:


A top secrecy jurisdiction is the United States, which doesn’t collect the names of shareholders of private companies and is unsurprisingly one of the most favored nations for hiding illicit wealth. (See, for example, this Reuters report on shell companies in Wyoming.) As Senator Carl Levin says, “It takes more information to obtain a driver’s license or open a U.S. bank account than it does to form a U.S. corporation.” Levin has introduced a bill that would end the formation of companies for unidentified persons, but that is unlikely to pass Congress.

If we picked one of the non-U.S. sponsors of the #DAPL, we might get lucky and hit a transparent or semi-transparent jurisdiction.

Let’s start with a semi-tough case, a U.S. corporation but a publicly traded one, Wells Fargo.

Where would you go next?

November 13, 2016

Outbrain Challenges the Research Community with Massive Data Set

Filed under: Contest,Data,Data Mining,Prediction — Patrick Durusau @ 8:15 pm

Outbrain Challenges the Research Community with Massive Data Set by Roy Sasson.

From the post:

Today, we are excited to announce the release of our anonymized dataset that discloses the browsing behavior of hundreds of millions of users who engage with our content recommendations. This data, which was released on the Kaggle platform, includes two billion page views across 560 sites, document metadata (such as content categories and topics), served recommendations, and clicks.

Our “Outbrain Challenge” is a call out to the research community to analyze our data and model user reading patterns, in order to predict individuals’ future content choices. We will reward the three best models with cash prizes totaling $25,000 (see full contest details below).

The sheer size of the data we’ve released is unprecedented on Kaggle, the competition’s platform, and is considered extraordinary for such competitions in general. Crunching all of the data may be challenging to some participants—though Outbrain does it on a daily basis.

The rules caution:


The data is anonymized. Please remember that participants are prohibited from de-anonymizing or reverse engineering data or combining the data with other publicly available information.

That would be a more interesting question than the ones presented for the contest.

After the 2016 U.S. presidential election we know that racists, sexists, nationalists, etc., are driven by single factors so assuming you have good tagging, what’s the problem?

Yes?

Or is human behavior is not only complex but variable?

Good luck!

November 8, 2016

We Should Feel Safer Than We Do

Filed under: Data Mining,R,Social Sciences — Patrick Durusau @ 5:38 pm

We Should Feel Safer Than We Do by Christian Holmes.

Christian’s Background and Research Goals:

Background

Crime is a divisive and important issue in the United States. It is routinely ranked as among the most important issue to voters, and many politicians have built their careers around their perceived ability to reduce crime. Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post, as well as determine if there is any clear correlation between government spending and crime.

Research Goals

-Is crime increasing or decreasing in this country?
-Is there a clear link between government spending and crime?

provide an interesting contrast with his conclusions:

From the crime data, it is abundantly clear that crime is on the decline, and has been for around 20 years. The reasons behind this decrease are quite nuanced, though, and I found no clear link between either increased education or police spending and decreasing crime rates. This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame.

In his background, Christian says:

Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post,…

Christian presumes, without proof, a relationship between: public beliefs about crime rates (rising or falling) and crime rates as recorded by government agencies.

Which also presumes:

  1. The public is aware that government collects crime statistics.
  2. The public is aware of current crime statistics.
  3. Current crime statistics influence public beliefs about the incidence of crime.

If the central focus of the paper is a comparison of “crime rates” as measured by government with other data on government spending, why even mention the disparity between public “belief” about crime and crime statistics?

I suspect, just as a rhetorical move, Christian is attempting to draw a favorable inference for his “evidence” by contrasting it with “public belief.” “Public belief” that is contrary to the “evidence” in this instance.

Christian doesn’t offer us any basis for judgments about public opinion on crime one way or the other. Any number of factors could be influencing public opinion on that issue, the crime rate as measured by government being only one of those.

The violent crime rate may be very low, statistically speaking, but if you are the victim of a violent crime, from your perspective crime is very prevalent.

Of R and Relationships

Christian uses R to compare crime date with government spending on education and policing.

The unhappy result is that no relationship is evidenced between government spending and a reduction in crime so Christian cautions:

…This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame….

There is where we switch from relying on data and explore the realms of “the data didn’t prove I was wrong.”

Since it isn’t possible to prove the absence of a relationship between the “crime rate” and government spending on education/police, no, the evidence didn’t prove Christian to be wrong.

On the other hand, it clearly shows that Christopher has no evidence for that “relationship.”

The caution here is that using R and “reliable” data may lead to conclusions you would rather avoid.

PS: Crime and the public’s fear of crime are both extremely complex issues. Aggregate data can justify previously chosen positions, but little more.

November 6, 2016

Debate Night Twitter: Analyzing Twitter’s Reaction to the Presidential Debate

Filed under: Data Mining,Government,Twitter — Patrick Durusau @ 5:34 pm

Debate Night Twitter: Analyzing Twitter’s Reaction to the Presidential Debate by George McIntire.

A bit dated content-wise but George covers techniques, from data gathering to analysis, useful for future events. Possible Presidential inauguration riots on January 20, 2017 for example. Or, the 2017 Super Bowl, where Lady GaGa will be performing.

From the post:

This past Sunday, Donald Trump and Hillary Clinton participated in a town hall-style debate, the second of three such events in this presidential campaign. It was an extremely contentious affair that reverberated across social media.

The political showdown was massively anticipated; the negative atmosphere of the campaign and last week’s news of Trump making lewd comments about women on tape certainly contributed to the fire. Trump further escalated the immense tension by holding a press conference with women who’ve accused former President Bill Clinton of abusing.

With having a near unprecedented amount of attention and hostility, I wanted to gauge Twitter’s reaction to the event. In this project, I streamed tweets under the hashtag #debate and analyzed them to discover trends in Twitter’s mood and how users were reacting to not just the debate overall but to certain events in the debate.

What techniques will you apply to your tweet data sets?

November 5, 2016

Clinton/Podesta Map (through #30)

Filed under: Data Mining,Hillary Clinton,Politics,Wikileaks — Patrick Durusau @ 7:26 pm

Charlie Grapski created Navigating Wikileaks: A Guide to the Podesta Emails.

podesta-map-grapski-460

The listing take 365 pages to date so this is just a tiny sample image.

I don’t have a legend for the row coloring but have tweeted to Charlie about the same.

Enjoy!

November 2, 2016

ggplot2 cheatsheet updated – other R spreadsheets

Filed under: Data Mining,Ggplot2,R — Patrick Durusau @ 7:32 pm

RStudio Cheat Sheets

I saw a tweet that the ggplot2 cheatsheet has been updated.

Here’s a list of all the cheatsheets available at RStudio:

  • R Markdown Cheat Sheet
  • RStudio IDE Cheat Sheet
  • Shiny Cheat Sheet
  • Data Visualization Cheat Sheet
  • Package Development Cheat Sheet
  • Data Wrangling Cheat Sheet
  • R Markdown Reference Guide

Contributed Cheatsheets

  • Base R
  • Advanced R
  • Regular Expressions
  • How big is your graph? (base R graphics)

I have deliberately omitted links as when cheat sheets are updated, the links will break and/or you will get outdated information.

Use and reference the RStudio Cheat Sheets page.

Enjoy!

Does Verification Matter? Clinton/Podesta Emails Update

Filed under: Data Mining,Hillary Clinton,News,Reporting — Patrick Durusau @ 12:51 pm

As of today, 10,357 DKIM Verified Clinton/Podesta Emails (of 43,526 total). That’s releases 1-26.

I ask “Does Verification Matter?” in the title to this post because of the seeming lack of interest in verification of emails in the media. Not that it would ever be a lead, but some mention of the verified/not status of an email seems warranted.

Every Clinton/Podesta story mentions Antony Weiner’s interest in sharing his sexual insecurities and nary a peep about the false Clinton/Obama/Clapper claims that emails have been altered. Easy enough to check. But no specifics are given or requested by the press.

Thanks to the Clinton/Podesta drops by Michael Best, @NatSecGeek, I have now uploaded:

DKIM-verified-podesta-1-26.txt.gz is a sub-set of 10,357 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-26.txt.gz is the full set of Podesta emails to date, some 43,526, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Enjoy!

PS: Perhaps verification doesn’t matter when the media repeats false and/or delusional statements of DNI Clapper in hopes of…, I don’t know what they are hoping for but I am hoping they are dishonest, not merely stupid.

October 31, 2016

9,477 DKIM Verified Clinton/Podesta Emails (of 39,878 total (today))

Filed under: Data Mining,Email,Hillary Clinton,News,Reporting — Patrick Durusau @ 3:19 pm

Still working on the email graph and at the same time, managed to catch up on the Clinton/Podesta drops by Michael Best, @NatSecGeek, at least for a few hours.

DKIM-verified-podesta-1-24.txt.gz is a sub-set of 9,477 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-24.txt.gz is the full set of Podesta emails to date, some 39,878, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Question: Have you seen any news reports that mention emails being “verified” in their reporting?

Emails in the complete set may be as accurate as those in the verified set, but I would think verification is newsworthy in and of itself.

You?

Parsing Emails With Python, A Quick Tip

Filed under: Data Mining,Email,Python — Patrick Durusau @ 1:32 pm

While some stuff runs in the background, a quick tip on parsing email with Python.

I got the following error message from Python:

Traceback (most recent call last):
File “test-clinton-script-31Oct2016.py”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 301, in parse
res = self._parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 349, in _parse
l = _timelex.split(timestr)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 143, in split
return list(cls(s))
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 137, in next
token = self.get_token()
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 68, in get_token
nextchar = self.instream.read(1)
AttributeError: ‘NoneType’ object has no attribute ‘read’

I have edited the email header in question but it reproduces the original error:

Delivered-To: john.podesta@gmail.com
Received: by 10.142.49.14 with SMTP id w14cs34683wfw;
Wed, 5 Nov 2008 08:11:39 -0800 (PST)
Received: by 10.114.144.1 with SMTP id r1mr728791wad.136.1225901498795;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Return-Path:
Received: from QMTA09.emeryville.ca.mail.comcast.net (qmta09.emeryville.ca.mail.comcast.net [76.96.30.96])
by mx.google.com with ESMTP id m26si29354pof.3.2008.11.05.08.11.38;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received-SPF: pass (google.com: domain of sewallconroy@comcast.net designates
Received: from OMTA03.emeryville.ca.mail.comcast.net ([76.96.30.27])
by QMTA09.emeryville.ca.mail.comcast.net with comcast
id bUBY1a0010b6N64A9UBeJl; Wed, 05 Nov 2008 16:11:38 +0000
Received: from amailcenter06.comcast.net ([204.127.225.106])
by OMTA03.emeryville.ca.mail.comcast.net with comcast
id bUAV1a00L2JMgtY8PUAV7G; Wed, 05 Nov 2008 16:10:30 +0000
X-Authority-Analysis: v=1.0 c=1 a=1Ht49J2nGmlg0oY3xr8A:9
a=8nxvWDfACCTtBObdks-tTUtrMyYA:4 a=OA_lqj45gZcA:10 a=diNjy0DT58-4uIkuavEA:9
a=e0_VUgpf8QEu0XMU188OmzzKrzoA:4 a=37WNUvjkh6kA:10
Received: from [24.34.75.99] by amailcenter06.comcast.net;
Wed, 05 Nov 2008 16:10:28 +0000
From: sewallconroy@comcast.net

To: “Podesta” , ricesusane@aol.com
CC: “Denis McDonough OFA” ,
djsberg@gmail.com”, marklippert@yahoo.com,
Subject: DOD leadership – immediate attention
Date: Wed, 05 Nov 2008 16:10:28 +0000
Message-Id: <110520081610.3048.4911C574000C2E2100000BE82216 55799697019D02010C04040E990A9C@comcast.net>
X-Mailer: AT&T Message Center Version 1 (Oct 30 2007)
X-Authenticated-Sender: c2V3YWxsY29ucm95QGNvbWNhc3QubmV0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=”NextPart_Webmail_9m3u9jl4l_3048_1225901428_0″

–NextPart_Webmail_9m3u9jl4l_3048_1225901428_0
Content-Type: text/plain
Content-Transfer-Encoding: 8bit

I’m comparing “Date” to similar emails and getting no joy.

Absence is hard to notice, but once you know the rule, it’s obvious:

RFC822: Standard for ARPA Internet Text Messages says in part:

3. Lexical Analysis of Messages

3.1 General Description

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). (emphasis added)

Yep, the blank line I introduced while removing an errant double-quote on a line by itself, created the start for the body of the message.

Meaning that my Python script failed to find the “Date:” field and returning what someone thought would be a useful error message.

When you get errors parsing emails with Python (and I assume in other languages), check the format of your messages!

RFC822 has an appendix of parsing rules and a few examples.

Suggested listings of the most common email/email header format errors?

Clinton/Podesta Emails 23 and 24, True or False? Cryptographically Speaking

Filed under: Data Mining,Hillary Clinton,News,Politics,Reporting — Patrick Durusau @ 10:21 am

Catching up on the Clinton/Podesta email releases from Wikileaks, via Michael Best, NatSecGeek. Michael bundles the releases up and posts them at: Podesta emails (zipped).

For anyone coming late to the game, DKIM “verified” means that the DKIM signature on an email is valid for that email.

In lay person’s terms, that email has been proven by cryptography to have originated from a particular mail server and when it left that mail server, it read exactly as it does now, i.e., no changes by Russians or others.

What I have created are files that lists the emails in the order they appear at Wikileaks, with the very next field being True or False on the verification issue.

Just because an email has “False” in the second column doesn’t mean it has been modified or falsified by the Russians.

DKIM signatures fail for all manner of reasons but when they pass, you have a guarantee the message is intact as sent.

For your research into these emails:

DKIM-complete-podesta-23.txt.gz

and

DKIM-complete-podesta-24.txt.gz.

For release 24, I did have to remove the DKIM signature on 39256 00010187.eml in order for the script to succeed. That is the only modification I made to either set of files.

October 30, 2016

Clinton/Podesta Emails – Towards A More Complete Graph (Part 3) New Dump!

Filed under: Data Mining,Graphs,Hillary Clinton,Politics — Patrick Durusau @ 8:07 pm

As you may recall from Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I didn’t check to see if “|” was in use as a separator in the extracted emails subject lines so when I tried to create node lists based on “|” as a separator, it failed.

That happens. More than many people are willing to admit.

In the meantime, a new dump of emails has arrived so I created the new DKIM-incomplete-podesta-1-22.txt.gz file. Which mean picking a new separator to use for the resulting file.

Advice: Check your proposed separator against the data file before using it. I forgot, you shouldn’t.

My new separator? |/|

Which I checked against the file to make sure there would be no conflicts.

The sed commands to remove < and > are the same as in Part 2.

Sigh, back to failure land, again.

Just as one sample:

awk 'FS="|/|" { print $7}' test.me

where test.me is:

9991 00013434.eml|/|False|/|2015-11-21 17:15:25-05:00|/|Eryn Sepp eryn.sepp@gmail.com|/|John Podesta john.podesta@gmail.com|/|Re: Nov 30 / Future Plans / Etc.!|/|8A6B3E93-DB21-4C0A-A548-DB343BD13A8C@gmail.com

returns:

Future Plans

I also checked that with gawk and nawk, with the same result.

For some unknown (to me) reason, all three are treating the first “/” in field 6 (by my count) as a separator, along with the second “/” in that field.

To test that theory, what do you think { print $8 } will return?

You’re right!

Etc.!|

So with the “|/|” separator, I’m going to have up to at least 9 fields, perhaps more, varying depending on whether “/” characters occur in the subject line.

🙁

That’s not going to work.

OK, so I toss the 10+ MB DKIM-complete-podesta-1-22.txt.gz into Emacs, whose regex treatment I trust, and change “|/|” to “@@@@@” and save that file as DKIM-complete-podesta-1-22-03.txt.

Another sanity check, which got us into all this trouble last time:

awk 'FS="@@@@@" { print $7}' podesta-1-22-03.txt | grep @ | wc -l

returns 36504, which plus the 16 files I culled as failures, equals 36520, the number of files in the Podesta 1-22 release.

Recall that all message-ids contain an @ sign to the correct answer on the number of files gives us confidence the file is ready for further processing.

Apologies for it taking this much prose to go so little a distance.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Our first node for the node list (Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)) was to capture the emails themselves.

Using Message-Id (field 7) as the identifier and Subject (field 6) as its label.

We are about to encounter another problem but let’s walk through it.

An example of what we are expecting:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg
@mail.gmail.com;”Re: Tomorrow”;

We have the Message-Id with a closing “;”, followed by the Subject, surrounded in double quote marks and also terminated by a “;”.

FYI: Mixing single and double quotes in awk is a real pain. I struggled with it but then was reminded I can declare variables:

-v dq='"'

which allows me to do this:

awk -v dq='"' 'FS="@@@@@" { print $7 ";" dq $6 dq ";"}' podesta-1-22-03.txt

The awk variable trick will save you considerable puzzling over escape sequences and the like.

Ah, now we are to the problem I mentioned above.

In the part 1 post I mentioned that while:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com;”Re: Tomorrow”;

works,

but having:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;https://wikileaks.org/podesta-emails/emailid/9998;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com;”Re: Tomorrow”;https://wikileaks.org/podesta-emails/emailid/9999;

with Wikileaks links is more convenient for readers.

As you may recall, the last two lines read:

9998 00022160.eml@@@@@False@@@@@2015-06-23 23:01:55-05:00@@@@@Jerome Tatar jerry@TatarLawFirm.com@@@@@Jerome Tatar Jerome jerry@tatarlawfirm.com@@@@@Knox Knotes@@@@@CAC9z1zL9vdT+9FN7ea96r
+Jjf2=gy1+821u_g6VsVjr8U2eLEg@mail.gmail.com
9999 00013746.eml@@@@@False@@@@@2015-04-03 01:14:56-04:00@@@@@Eryn Sepp eryn.sepp@gmail.com@@@@@John Podesta john.podesta@gmail.com@@@@@Re: Tomorrow@@@@@CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com

Which means in addition to printing Message-Id and Subject as fields one and two, we need to split ID on the space and use the result to create the URL back to Wikileaks.

It’s late so I am going to leave you with DKIM-incomplete-podesta-1-22.txt.gz. This is complete save for 16 files that failed to parse. Will repost tomorrow with those included.

I have the first node file script working and that will form the basis for the creation of the edge lists.

PS: Look forward to running awk files tomorrow. It makes a number of things easier.

Exploding the Dakota Access Pipeline Target List

Filed under: #DAPL,Data Mining,Government,Politics — Patrick Durusau @ 2:05 pm

Who Is Funding the Dakota Access Pipeline? Bank of America, HSBC, UBS, Goldman Sachs, Wells Fargo by Amy Goodman and Juan Gonz´lez.

Great discussion of the funding behind the North Dakota Pipeline project!

They point to two important graphics to share, the first from: Who’s Banking on the Dakota Access Pipeline?:

view this map on LittleSis

Thanks for the easy embedding! (Best viewed at view this map on LittleSis)

And, from Who’s Banking on the Dakota Access Pipeline? (Food & Water Watch):

1608_bakkenplans-bakken-banking-fina-460

The full scale Food & Water Watch image.

Both visualizations are great ways to see some of the culprits responsible for the Dakota Access Pipeline, but not all.

Tracking the funding from Bakken Dakota Access Pipeline back to, among others, Citibank, Credit Agricole, ING Bank, and Natixis, should be a clue as to the next step.

All of the sources of financing, Citibank, Credit Agricole, ING Bank, Natixis, etc., are owned, one way or another, by investors. Moreover, as abstract entities, they cannot act without the use of agents, both as staff and as contractors.

If you take the financing entities as nodes (the first visualization), those should explode into both investor/owners and staff/agents, who do their bidding.

Thinking Citibank, for example, is too large and diffuse a target for effective political, social or economic pressure, but the smaller the part, the greater chance there is to have influence.

It’s true some nation states might be able to call Citibank to heel and if you can whistle one up, give it a shot. But while waiting on you to make you move, the rest of us should be looking for targets more within our reach.

That lesson, the one of the financiers exploding into more manageable targets (don’t overlook their political allies and their extended target lists), the same is true for the staffs and agents of Sunoco Logistics, Energy Transfer Partner, Energy Equity Transfer (a sister partnership to Energy Transfer Pipeline), and Bakken Dakota Access Pipeline.

I have yet to see an abstract entity drive a bulldozer, move pipe, etc. Despite the popular fiction that a corporation is a person, it’s somebody on the ground violating the earth, poisoning the water, disturbing sacred ground, all for the benefit of some other natural person.

Corporations, being abstract entities, cannot feel pressure. Their staffs and contractors, on the other hand, don’t have that immunity.

It will be post-election in the US, but I’m interested in demonstrating and assisting in demonstrating, how to explode these target lists in both directions.

As always, your comments and suggestions are most welcome.

PS: One source for picking up names and entities would be: Bakken.com, which self-describes as:

Bakken.com is owned and operated by Energy Media Group which is based in Fargo North Dakota. Bakken.com was brought to life to fill a gap in the way that news was brought to the people on this specific energy niche. We wanted to have a place where people could come for all the news related to the Bakken Shale. Energy Media group owns several hundred websites, most of which surround shale formations all over the world.Our sister site Marcellus.com went live at the beginning of December and already has a huge following. In the coming months, Bakken.com will be a place not only for news, but for jobs and classifieds as well. Thank-you for visiting us today, and make sure to sign up for our news letter to get the latest updates and also Like us on Facebook.

To give you the full flavor of their coverage: Oil pipeline protester accused of terrorizing officer:

North Dakota authorities have issued an arrest warrant accusing a pipeline protester on horseback of charging at a police officer.

Mason Redwing, of Fort Thompson, South Dakota, is wanted on felony charges of terrorizing and reckless endangerment in the Sept. 28 incident near St. Anthony. He’s also wanted on a previous warrant for criminal trespass.

The Morton County Sheriff’s Office says the officer shouted a warning and pointed a shotgun loaded with non-lethal beanbag rounds to defuse the situation.

I’m on a horse and you have a shotgun. Who is it that is being terrorized?

Yeah, it’s that kind of reporting.

October 29, 2016

Clinton/Podesta Emails – 1-22 – Progress Report

Filed under: Data Mining,Government,Hillary Clinton — Patrick Durusau @ 8:49 pm

Michael Best has posted a bulk download of the Clinton/Podesta Eamils, 1-22.

Thinking (foolishly) that I could quickly pickup from where I left off yesterday, Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I grabbed the lasted bulk download and tossed the verification script against it.

In addition to the usual funnies of having to repair known defective files, again, I also underestimated how long verification of DKIM signatures takes on 36,000+ emails. Even on a fast box with plenty of memory.

At this point I have the latest release DKIM signatures parsed, but there are several files that fail for no discernible reason.

I’m going to have another go at it in the morning and should have part 3 of the graph work up tomorrow.

Apologies for the delay!

October 28, 2016

Clinton/Podesta Emails – Towards A More Complete Graph (Part 2)

Filed under: Data Mining,Gephi,Graphs,Hillary Clinton — Patrick Durusau @ 7:46 pm

I assume you are starting with DKIM-complete-podesta-1-18.txt.gz.

If you are starting with another source, you will need different instructions. 😉

First, remember from Clinton/Podesta Emails – Towards A More Complete Graph (Part 1) that I wanted to delete all the < and > signs from the text.

That’s easy enough (uncompress the text first):

sed 's/<//g' DKIM-complete-podesta-1-18.txt > DKIM-complete-podesta-1-18-01.txt

followed by:

sed 's/>//g' DKIM-complete-podesta-1-18-01.txt > DKIM-complete-podesta-1-18-02.txt

Here’s where we started:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner <jdorner@americanprogress.org>|”‘bigcampaign@googlegroups.com'” <bigcampaign@googlegroups.com>|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|<A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F @CAPMAILBOX.americanprogresscenter.org>
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin <jschwerin@hillaryclinton.com>|hrcrapid <HRCrapid@googlegroups.com>|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|<CAPrY+5KJ=NG+Vs-khDVpe-L=bP5=qvPcZTS5FDam5LixueQsKA@mail.gmail.com>

Here’s the result after the first two sed scripts:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner jdorner@americanprogress.org|”‘bigcampaign@googlegroups.com'” bigcampaign@googlegroups.com|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|
A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F
@CAPMAILBOX.americanprogresscenter.org
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin jschwerin@hillaryclinton.com|hrcrapid HRCrapid@googlegroups.com|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|CAPrY+5KJ=NG+Vs-khDVpe-L=bP5=qvPcZTS5FDam5LixueQsKA@mail.gmail.com

BTW, I increment the numbers of my result files, DKIM-complete-podesta-1-18-01.txt, DKIM-complete-podesta-1-18-02.txt, because when I don’t, I run different sed commands on the same original file, expecting a cumulative result.

That’s spelled – disappointment and wasted effort looking for problems that aren’t problems. Number your result files.

The nodes and edges mentioned in Clinton/Podesta Emails – Towards A More Complete Graph (Part 1):

Nodes

  • Emails, message-id is ID and subject is label, make Wikileaks id into link
  • From/To, email addresses are ID and name is label
  • True/False, true/false as ID, True/False as labels
  • Date, truncated to 2015-07-24 (example), date as id and label

Edges

  • To/From – Edges with message-id (source) email-address (target) to/from (label)
  • Verify – Edges with message-id (source) true/false (target) verify (label)
  • Date – Edges with message-id (source) – date (target) date (label)

Am I missing anything? The longer I look at problems like this the more likely my thinking/modeling will change.

What follows is very crude, command line creation of the node and edge files. Something more elaborate could be scripted/written in any number of languages.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

You don’t have to take my word for it, try this:

awk 'FS="|" { print $7}' DKIM-complete-podesta-1-18-02.txt

The output prints to the console. Those all look like message-ids to me, well, with the exception of the one that reads ” 24 September.”

How much dirty data do you think is in the message-id field?

A crude starting assumption is that any message-id field without the “@” character is dirty.

Let’s try:

awk ‘FS = “|” { print $7} DKIM-complete-podesta-1-18-02.txt | grep -v @ | wc -l

Which means we are going to extract the 7th field, search (grep) over those results for the “@” sign, where the -v switch means only print lines that DO NOT match, and we will count those lines with wc -l.

Ready? Press return.

I get 594 “dirty” message-ids.

Here is a sampling:


Rolling Stone
TheHill
MSNBC, Jeff Weaver interview on Sanders health care plan and his Wall St. ad
Texas Tribune
20160120 Burlington, IA Organizing Event
start spreadin’ the news…..
Building Trades Union (Keystone XL)
Rubio Hits HRC On Cuba, Russia
2.19.2016
Sourcing Story
Drivers Licenses
Day 1
H1N1 Flu Shot Available

Those look an awful lot like subject lines to me. You?

I suspect those subject lines had the separator character “|” in those lines, before we extracted the data from the .eml files.

I’ve tried to repair the existing files but the cleaner solution is to return to the extraction script and the original email files.

More on that tomorrow!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)

Filed under: Data Mining,Gephi,Graphs,Hillary Clinton — Patrick Durusau @ 11:48 am

Gephi is a great tool, but it’s only as good as its input.

The Gephi 8.2 email importer (missing in Gephi 9.*) is lossy, informationally speaking, as I have mentioned before.

Here’s a sample from the verification results on podesta-release-1-18:

9981 00045326.eml|False|2015-07-24 12:58:16-04:00|Oren Shur |John Podesta , Robby Mook , Joel Benenson , “Margolis, Jim” , Mandy Grunwald , David Binder , Teddy Goff , Jennifer Palmieri , Kristina Schake , Christina Reynolds , Katie Connolly , “Kaye, Anson” , Peter Brodnitz , “Rimel, John” , David Dixon , Rich Davis , Marlon Marshall , Michael Halle , Matt Paul , Elan Kriegel , Jake Sullivan |FW: July IA Poll Results|<168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com>

The Gephi 8.2 mail importer fails to create a node representing an email message.

I propose we cure that failure by taking the last field, here:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com

and the next to last field:

FW: July IA Poll Results

and putting them as id and label, respectively in a node list:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”;

As part of the transformation, we need to remove the < and > signs around the message ID, then add a ; to mark the end of the ID field and put double quote ” “ around the subject to use it as a label. Then close the second field with another ;.

While we are talking about nodes, all the email addresses change from:

Oren Shur

to:

oshur@hillaryclinton.com; “Oren Shur”;

which are ID and label of the node list, respectively.

I could remove the < and > characters as part of the extraction script but will use sed at the command line instead.

Reminder: Always work on a copy of your data, never the original.

Then we need to create an edge list, one that represents the relationships between the email (as node) to the sender and receivers of the email (also nodes). For this first iteration, I’m going to use labels on the edges to distinguish between senders and receivers.

Assuming my first row of the edges file reads:

Source; Target; Role (I did not use “Type” because I suspect that is a controlled term for Gephi.)

Then the first few edges would read:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; oshur@hillaryclinton.com>; from;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; john.podesta@gmail.com>; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; re47@hillaryclinton.com; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; jbenenson@bsgco.com; to;
168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; Jim.Margolis@gmmb.com; to;
….

As you can see, this is going to be a “busy” graph! 😉

Filtering is going to play an important role in exploring this graph, so let’s add nodes that will help with that process.

I propose we add to the node list:

true; True
false; False

as id and labels.

Which means for the edge list we can have:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; true; verify;

Do you have an opinion on the order, source/target for true/false?

Thinking this will enable us to filter nodes that have not been verified or to include only those that have failed verification.

For experimental purposes, I think we need to rework the date field:

2015-07-24 12:58:16-04:00

I would truncate that to:

2015-07-24

and add such truncated dates to the node list:

2015-07-24; 2015-07-24;

as ID and label, respectively.

Then for the edge list:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; 2015-07-24; date;

Reasoning that we can filter to include/exclude nodes based on dates, which if you add enough email boxes, could help visualize the reaction to and propagation of emails.

Even assuming named participants in these emails have “deleted” their inboxes, there are always automatic backups. It’s just a question of persistence before the rest of this network can be fleshed out.

Oh, one last thing. You have probably notice the Wikileaks “ID” that forms part of the filename?

9981 00045326.eml

The first part forms the end of a URL to link to the original post at Wikileaks.

Thus, in this example, 9981 becomes:

https://wikileaks.org/podesta-emails/emailid/9981

The general form being:

https://wikileaks.org/podesta-emails/emailid/(Wikileaks-ID)

For the convenience of readers/users, I want to modify my earlier proposal for the email node list entry from:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”;

to:

168005c7ccb3cbcc0beb3ffaa08a8767@mail.gmail.com; “FW: July IA Poll Results”; https://wikileaks.org/podesta-emails/emailid/9981;

Where the third field is “link.”

I am eliding over lots of relationships and subjects but I’m not reluctant to throw it all away and start over.

Your investment in a model isn’t lost by tossing the model, you learn something with every model you build.

Scripting underway, a post on that experience and the node/edge lists to follow later today.

October 27, 2016

Podesta/Clinton Emails: Filtering by Email Address (Pimping Bill Clinton)

Filed under: Data Mining,Gephi,Graphs,Hillary Clinton — Patrick Durusau @ 8:54 pm

The Bill Clinton, Inc. story reminds me of:

Although I steadfastly resist imaging either Bill or Hillary in that video. Just won’t go there!

Where a graph display can make a difference is that instead of just the one email/memo from Bill’s pimp, we can rapidly survey all of the emails in which he appears, in any role.

import-spigot-email-filter-460

I ran that on Gephi 8.2 against podesta-release-1-18 but the results were:

Nodes 0, Edges 0.

Hmmm, there is something missing, possibly the CSV file?

I checked and podesta-release-1-18 has 393 emails where doug@presidentclinton.com appears.

Could try to find the “right” way to filter on email addresses but for now, let’s take a dirty short-cut.

I created a directory to hold all emails with doug@presidentclinton.com and ran into all manner of difficulties because the file names are plagued with spaces!

So much so that I unintentionally (read “by mistake”) saved all the doug@presidentclinton.com posts from podesta-release-1-18 to a different folder than the ones from podesta-release-19.

🙁

Well, but there is a happy outcome and an illustration of yet another Gephi capability.

I build the first graph from the doug@presidentclinton.com posts from podesta-release-1-18 and then with that graph open, imported the doug@presidentclinton.com from podesta-release-19 and appended those results to the open graph.

How cool is that!

Imagine doing that across data sets, assuming you paid close attention to identifiers, etc.

Sorry, back to the graphs, here is the random layout once the graphs were combined:

doug-default-460

Applying the Yifan Hu network layout:

doug-network-yifan-hu-460

I ran network statistics on network diameter and applied colors based on betweenness:

doug-network-centrality-460

And finally, adjusted the font and turned on the labels:

doub-network-labels-460

I have spent a fair amount of time just moving stuff about but imagine if you could interactively explore the emails, creating and trashing networks based on to:, from:, cc:, dates, content, etc.

The limits of Gephi imports were a major source of pain today.

I’m dodging those tomorrow in favor of creating node and adjacency tables with scripts.

PS: Don’t John Podesta and Doug Band look like two pimps in a pod? 😉

PPS: If you haven’t read the pimping Bill Clinton memo. (I think it has some other official title.)

October 26, 2016

Clinton/Podesta 19, DKIM-verified-podesta-19.txt.gz, DKIM-complete-podesta-19.txt.gz

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 8:02 pm

Michael Best, @NatSecGeek, posted release 19 of the Clinton/Podesta emails at: https://archive.org/details/PodestaEmailszipped today.

A total of 1518 emails, zero (0) of which broke my script!

Three hundred and sixty-three were DKIM verified! DKIM-verified-podesta-19.txt.gz.

The full set of emails, verified and not: DKIM-complete-podesta-19.txt.gz.

I’m still pondering how to best organize the DKIM verified material for access.

I could segregate “verified” emails for indexing. So any “hits” from those searches are from “verified” emails?

Ditto for indexing only attachments of “verified” emails.

What about a graph constructed solely from “verified” emails?

Or should I make verified a property of the emails as nodes? Reasoning that aside from exploring the email importation in Gephi 8.2, it would not be that much more difficult to build node and adjacency lists from the raw emails.

Thoughts/suggestions?

Serious request for help.

Like Gollum, I know what I keep in my pockets, but I have no idea what other people keep in theirs.

What would make this data useful to you?

October 25, 2016

Clinton/Podesta 1-18, DKIM-verified-podesta-1-18.txt.zip, DKIM-complete-podesta-1-18.txt.zip

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 9:51 pm

After a long day of waiting for scripts to finish and re-running them to cross-check the results, I am happy to present:

DKIM-verified-podesta-1-18.txt.gz, which consists of the Podesta emails (7526) which returned true for a test of their DKIM signature.

The complete set of the results for all 31,819 emails, can be found in:

DKIM-complete-podesta-1-18.txt.gz.

An email that has been “verified” has a cryptographic guarantee that it was sent even as it appears to you now.

An email that fails verification, may be just as trustworthy, but its DKIM signature has failed for any number of reasons.

One of my motivations for classifying these emails is to enable the exploration of why DKIM verification failed on some of these emails.

Question: What would make this data more useful/accessible to journalists/bloggers?

I ask because dumping data and/or transformations of data can be useful, it is synthesizing data into a coherent narrative that is the essence of journalism/reporting.

I would enjoy doing the first in hopes of furthering the second.

PS: More emails will be added to this data set as they become available.

Corrupt (fails with my script) files in Clinton/Podesta Emails (14 files out of 31,819)

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 7:32 pm

You may use some other definition of “file corruption” but that’s mine and I’m sticking to it.

😉

The following are all the files that failed against my script and the actions I took to proceed with parsing the files. Not today but I will make a sed script to correct these files as future accumulations of emails appear.

13544 00047141.eml

Date string parse failed:

Date: Wed, 17 Dec 2008 12:35:42 -0700 (GMT-07:00)

Deleted (GMT-07:00).

15431 00059196.eml

Date string parse failed:

Date: Tue, 22 Sep 2015 06:00:43 +0800 (GMT+08:00)

Deleted (GMT+8:00).

155 00049680.eml

Date string parse failed:

Date: Mon, 27 Jul 2015 03:29:35 +0000

Assuming, as the email reports, info@centerpeace.org was the sender and podesta@law.georgetown.edu was the intended receiver, then the offset from UT is clearly wrong (+0000).

Deleted +0000.

6793 00059195.eml

Date string parse fail:

Date: Tue, 22 Sep 2015 05:57:54 +0800 (GMT+08:00)

Deleted (GTM+08:00).

9404 0015843.eml DKIM failure

All of the DKIM parse failures take the form:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 18, in
verified = dkim.verify(data)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 604, in verify
return d.verify(dnsfunc=dnsfunc)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 506, in verify
validate_signature_fields(sig)
File “/usr/lib/python2.7/dist-packages/dkim/__init__.py”, line 181, in validate_signature_fields
if int(sig[b’x’]) < int(sig[b't']): KeyError: 't'

I simply deleted the DKIM-Signature in question. Will go down that rabbit hole another day.

21960 00015764.eml

DKIM signature parse failure.

Deleted DKIM signature.

23177 00015850.eml

DKIM signature parse failure.

Deleted DKIM signature.

23728 00052706.eml

Invalid character in RFC822 header.

I discovered an errant ‘”‘ (double quote mark) at the start of a line.

Deleted the double quote mark.

And deleted ^M line endings.

25040 00015842.eml

DKIM signature parse failure.

Deleted DKIM signature.

26835 00015848.eml

DKIM signature parse failure.

Deleted DKIM signature.

28237 00015840.eml

DKIM signature parse failure.

Deleted DKIM signature.

29052 0001587.eml

DKIM signature parse failure.

Deleted DKIM signature.

29099 00015759.eml

DKIM signature parse failure.

Deleted DKIM signature.

29593 00015851.eml

DKIM signature parse failure.

Deleted DKIM signature.

Here’s an odd pattern for you, all nine (9) of the fails to parse the DKIM signatures were on mail originating from:

From: Gene Karpinski

But there are approximately thirty-three (33) emails from Karpinski so it doesn’t fail every time.

The file numbers are based on the 1-18 distribution of Podesta emails created by Michael Best, @NatSecGeek, at: Podesta Emails (zipped).

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Filed under: Data Mining,Hillary Clinton,Python — Patrick Durusau @ 4:26 pm

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “test-clinton-script-24Oct2016.py”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/parser.py”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.

Why?

Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

http://docs.python.org/library/os.html?highlight=os.listdir#os.listdir

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:

sorted(glob.glob('*.png'))

sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)

etc.

So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):

to:

for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

October 24, 2016

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Filed under: Data Mining,Hillary Clinton,Politics — Patrick Durusau @ 9:05 pm

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.

#!/usr/bin/python

import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data = f.read()
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")

output.close()

Download podesta-test.tar.gz, unpack that to a directory and then save/uppack test-clinton-script-24Oct2016.py.gz to the same directory, then:

python test-clinton-script-24Oct2016.py

Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

March 29, 2016

Serious Non-Transparency (+ work around)

Filed under: Books,Data Mining,Law,Searching — Patrick Durusau @ 4:00 pm

I mentioned http://www.bkstr.com/ yesterday in my post: Courses -> Texts: A Hidden Relationship, where I lamented the inability to find courses by their titles.

So you could easily discover the required/suggested texts for any given course. Like browsing a physical campus bookstore.

Obscurity is an “information smell” (to build upon Felienne‘s expansion of code smell to spreadsheets).

In this particular case, the “information smell” is skunk class.

I revisited http://www.bkstr.com/ today to extract its > 1200 bookstores for use in crawling a sample of those sites.

For ugly HTML, view the source of: http://www.bkstr.com/.

Parsing that is going to take time and surely there is an easy way to get a sample of the sites for mining.

The idea didn’t occur to me immediately but I noticed yesterday that the general form of web addresses was:

bookstore-prefix.bkstr.com

So, after some flailing about with the HTML from bkstr.com, I searched for “bkstr.com” and requested all the results.

I’m picking a random ten bookstores with law books for further searching.

Not a high priority but I am curious what lies behind the smoke, mirrors, complex HTML and poor interfaces.

Maybe something, maybe nothing. Won’t know unless we look.

PS: Perhaps a better query string:

www.bkstr.com textbooks-and-course-materials

Suggested refinements?

March 5, 2016

Data Mining Patterns in Crossword Puzzles [Patterns in Redaction?]

Filed under: Crossword Puzzle,Data Mining,Pattern Matching,Pattern Recognition,Security — Patrick Durusau @ 12:06 pm

A Plagiarism Scandal Is Unfolding In The Crossword World by Oliver Roeder.

From the post:

A group of eagle-eyed puzzlers, using digital tools, has uncovered a pattern of copying in the professional crossword-puzzle world that has led to accusations of plagiarism and false identity.

Since 1999, Timothy Parker, editor of one of the nation’s most widely syndicated crosswords, has edited more than 60 individual puzzles that copy elements from New York Times puzzles, often with pseudonyms for bylines, a new database has helped reveal. The puzzles in question repeated themes, answers, grids and clues from Times puzzles published years earlier. Hundreds more of the puzzles edited by Parker are nearly verbatim copies of previous puzzles that Parker also edited. Most of those have been republished under fake author names.

Nearly all this replication was found in two crosswords series edited by Parker: the USA Today Crossword and the syndicated Universal Crossword. (The copyright to both puzzles is held by Universal Uclick, which grew out of the former Universal Press Syndicate and calls itself “the leading distributor of daily puzzle and word games.”) USA Today is one of the country’s highest-circulation newspapers, and the Universal Crossword is syndicated to hundreds of newspapers and websites.

On Friday, a publicity coordinator for Universal Uclick, Julie Halper, said the company declined to comment on the allegations. FiveThirtyEight reached out to USA Today for comment several times but received no response.

Oliver does a great job setting up the background on crossword puzzles and exploring the data that underlies this story. A must read if you are interested in crossword puzzles or know someone who is.

I was more taken with “how” the patterns were mined, which Oliver also covers:


Tausig discovered this with the help of the newly assembled database of crossword puzzles created by Saul Pwanson [1. Pwanson changed his legal name from Paul Swanson] a software engineer. Pwanson wrote the code that identified the similar puzzles and published a list of them on his website, along with code for the project on GitHub. The puzzle database is the result of Pwanson’s own Web-scraping of about 30,000 puzzles and the addition of a separate digital collection of puzzles that has been maintained by solver Barry Haldiman since 1999. Pwanson’s database now holds nearly 52,000 crossword puzzles, and Pwanson’s website lists all the puzzle pairs that have a similarity score of at least 25 percent.

The .xd futureproof crossword format page reads in part:

.xd is a corpus-oriented format, modeled after the simplicity and intuitiveness of the markdown format. It supports 99.99% of published crosswords, and is intended to be convenient for bulk analysis of crosswords by both humans and machines, from the present and into the future.

My first thought was of mining patterns in government redacted reports.

My second thought was that an ASCII format that specifies line length (to allow for varying font sizes) in characters, plus line breaks and lines composed of characters, whitespace and markouts as single characters should fit the bill. Yes?

Surely such a format exists now, yes? Pointers please!

There are those who merit protection by redacted documents, but children are more often victimized by spy agencies than employed by them.

January 30, 2016

9 “Laws” for Data Mining [Be Careful With #5]

Filed under: Data Mining,Data Science — Patrick Durusau @ 10:13 pm

9 “Laws” for Data Mining

A Forbes piece on “laws” for data mining, that are equally applicable to data science.

Being Forbes, technology is valuable because it has value for business, not because “everyone is doing it,” “it’s really cool technology,” “it’s a graph,” or “it will bring all humanity to a new plane of existence.”

To be honest, Forbes is a welcome relief some days.

But even Forbes stumbles, as with law #5:

5. There are always patterns: In practice, your data always holds useful information to support decision-making and action.

What? “…your data always holds useful information to support decision-making and action.

That’s as nutty as the “new plane of existence” stuff.

When I say “nutty,” I mean that in a professional sense. The term apohenia was coined to label the tendency to see meaningful patterns in random data. (Yes, that includes your data.) Apophenia.

The original work described the “…onset of delusional thinking in pyschosis.”

No doubt you will find patterns in your data but that the patterns “…holds useful information to support decision-making and action” isn’t a given.

That is an echo of the near fanatic belief that if businesses used big data, they would be more profitable.

Most of the other “laws” are more plausible than #5, but even there, don’t abandon your judgement even if Forbes says that something is so.

I first saw this in a tweet by Data Science Renee.

Older Posts »

Powered by WordPress