Archive for the ‘Data Mining’ Category

Unmet Needs for Analyzing Biological Big Data… [Data Integration #1 – Spells Market Opportunity]

Wednesday, February 15th, 2017

Unmet Needs for Analyzing Biological Big Data: A Survey of 704 NSF Principal Investigators by Lindsay Barone, Jason Williams, David Micklos.


In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principle investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high performance computing (HPC), bioinformatics support, multi-step workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

In particular, needs topic maps can address rank #1, #2, #6, #7, and #10, or as found by the authors:

A majority of PIs—across bioinformatics/other disciplines, larger/smaller groups, and the four NSF programs—said their institutions are not meeting nine of 13 needs (Figure 3). Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HP computing (71%) were the three greatest unmet needs. High performance computing was an unmet need for only 27% of PIs—with similar percentages across disciplines, different sized groups, and NSF programs.

or graphically (figure 3):

So, cloud, distributed, parallel, pipelining, etc., processing is insufficient?

Pushing undocumented and unintegratable data at ever increasing speeds is impressive but gives no joy?

This report will provoke another round of Esperanto fantasies, that is the creation of “universal” vocabularies, which if used by everyone and back-mapped to all existing literature, would solve the problem.

The number of Esperanto fantasies and the cost/delay of back-mapping to legacy data defeats all such efforts. Those defeats haven’t prevented repeated funding of such fantasies in the past, present and no doubt the future.

Perhaps those defeats are a question of scope.

That is rather than even attempting some “universal” interchange of data, why not approach it incrementally?

I suspect the PI’s surveyed each had some particular data set in mind when they mentioned data integration (which itself is a very broad term).

Why not seek out, develop and publish data integrations in particular instances, as opposed to attempting to theorize what might work for data yet unseen?

The need topic maps wanted to meet remains unmet. With no signs of lessening.

Opportunity knocks. Will we answer?

Mining Twitter Data with Python [Trump Years Ahead]

Wednesday, December 21st, 2016

Marco Bonzanini, author of Mastering Social Media Mining with Python, has a seven part series of posts on mining Twitter with Python.

If you haven’t been mining Twitter before now, President-elect Donald Trump is about to change all that.

What if Trump continues to tweet as President and authorizes his appointees to do the same? Spontaneity isn’t the same thing as openness but it could prove to be interesting.

Refining The Dakota Access Pipeline Target List

Sunday, November 20th, 2016

I mentioned in Exploding the Dakota Access Pipeline Target List that while listing of the banks financing Dakota Access Pipeline is great, banks and other legal entities are owned, operated and act through people. People, who unlike abstract legal entities, are subject to the persuasion of other people.

Unfortunately, almost all discussions of #DAPL focus on the on-site brutality towards Native Americans and/or the corporations involved in the project.

The protesters deserve our support but resisting local pawns (read police) may change the route of the pipeline, but it won’t stop the pipeline.

To stop the Dakota Access Pipeline, there are only two options:

  1. Influence investors to abandon the project
  2. Make the project prohibitively expensive

In terms of #1, you have to strike through the corporate veil to reach the people who own and direct the affairs of the corporation.

“Piercing the corporate veil” is legal terminology but I mean it as in knowing the named and located individuals are making decisions for a corporation and the named and located individuals who are its owners.

A legal fiction, such as a corporation, cannot feel public pressure, distress, social ostracism, etc., all things that people are subject to suffering.

Even so, persuasion can only be brought to bear on named and located individuals.

News reports giving only corporate names and not individual owners/agents creates a boil of obfuscation.

A boil of obfuscation that needs lancing. Shall we?

To get us off on a common starting point, here are some resources I will be reviewing/using:

Corporate Research Project

The Corporate Research Project assists community, environmental and labor organizations in researching companies and industries. Our focus is on identifying information that can be used to advance corporate accountability campaigns. [Sponsors Dirt Diggers Digest]

Dirt Diggers Digest

chronicling corporate misbehavior (and how to research it) [blog]


LittleSis* is a free database of who-knows-who at the heights of business and government.

* opposite of Big Brother


The largest open database of companies in the world [115,419,017 companies]

Revealing the World of Private Companies by Sheila Coronel

Coronel’s blog post has numerous resources and links.

She also points out that the United States is a top secrecy destination:

A top secrecy jurisdiction is the United States, which doesn’t collect the names of shareholders of private companies and is unsurprisingly one of the most favored nations for hiding illicit wealth. (See, for example, this Reuters report on shell companies in Wyoming.) As Senator Carl Levin says, “It takes more information to obtain a driver’s license or open a U.S. bank account than it does to form a U.S. corporation.” Levin has introduced a bill that would end the formation of companies for unidentified persons, but that is unlikely to pass Congress.

If we picked one of the non-U.S. sponsors of the #DAPL, we might get lucky and hit a transparent or semi-transparent jurisdiction.

Let’s start with a semi-tough case, a U.S. corporation but a publicly traded one, Wells Fargo.

Where would you go next?

Outbrain Challenges the Research Community with Massive Data Set

Sunday, November 13th, 2016

Outbrain Challenges the Research Community with Massive Data Set by Roy Sasson.

From the post:

Today, we are excited to announce the release of our anonymized dataset that discloses the browsing behavior of hundreds of millions of users who engage with our content recommendations. This data, which was released on the Kaggle platform, includes two billion page views across 560 sites, document metadata (such as content categories and topics), served recommendations, and clicks.

Our “Outbrain Challenge” is a call out to the research community to analyze our data and model user reading patterns, in order to predict individuals’ future content choices. We will reward the three best models with cash prizes totaling $25,000 (see full contest details below).

The sheer size of the data we’ve released is unprecedented on Kaggle, the competition’s platform, and is considered extraordinary for such competitions in general. Crunching all of the data may be challenging to some participants—though Outbrain does it on a daily basis.

The rules caution:

The data is anonymized. Please remember that participants are prohibited from de-anonymizing or reverse engineering data or combining the data with other publicly available information.

That would be a more interesting question than the ones presented for the contest.

After the 2016 U.S. presidential election we know that racists, sexists, nationalists, etc., are driven by single factors so assuming you have good tagging, what’s the problem?


Or is human behavior is not only complex but variable?

Good luck!

We Should Feel Safer Than We Do

Tuesday, November 8th, 2016

We Should Feel Safer Than We Do by Christian Holmes.

Christian’s Background and Research Goals:


Crime is a divisive and important issue in the United States. It is routinely ranked as among the most important issue to voters, and many politicians have built their careers around their perceived ability to reduce crime. Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post, as well as determine if there is any clear correlation between government spending and crime.

Research Goals

-Is crime increasing or decreasing in this country?
-Is there a clear link between government spending and crime?

provide an interesting contrast with his conclusions:

From the crime data, it is abundantly clear that crime is on the decline, and has been for around 20 years. The reasons behind this decrease are quite nuanced, though, and I found no clear link between either increased education or police spending and decreasing crime rates. This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame.

In his background, Christian says:

Over 70% of Americans believe that crime is increasing, according to a recent Gallup poll, but is that really the case? I seek to answer this question in this post,…

Christian presumes, without proof, a relationship between: public beliefs about crime rates (rising or falling) and crime rates as recorded by government agencies.

Which also presumes:

  1. The public is aware that government collects crime statistics.
  2. The public is aware of current crime statistics.
  3. Current crime statistics influence public beliefs about the incidence of crime.

If the central focus of the paper is a comparison of “crime rates” as measured by government with other data on government spending, why even mention the disparity between public “belief” about crime and crime statistics?

I suspect, just as a rhetorical move, Christian is attempting to draw a favorable inference for his “evidence” by contrasting it with “public belief.” “Public belief” that is contrary to the “evidence” in this instance.

Christian doesn’t offer us any basis for judgments about public opinion on crime one way or the other. Any number of factors could be influencing public opinion on that issue, the crime rate as measured by government being only one of those.

The violent crime rate may be very low, statistically speaking, but if you are the victim of a violent crime, from your perspective crime is very prevalent.

Of R and Relationships

Christian uses R to compare crime date with government spending on education and policing.

The unhappy result is that no relationship is evidenced between government spending and a reduction in crime so Christian cautions:

…This does not mean that such a relationship does not exist. Rather, it merely means that there is no obvious correlation between the two variables over this specific time frame….

There is where we switch from relying on data and explore the realms of “the data didn’t prove I was wrong.”

Since it isn’t possible to prove the absence of a relationship between the “crime rate” and government spending on education/police, no, the evidence didn’t prove Christian to be wrong.

On the other hand, it clearly shows that Christopher has no evidence for that “relationship.”

The caution here is that using R and “reliable” data may lead to conclusions you would rather avoid.

PS: Crime and the public’s fear of crime are both extremely complex issues. Aggregate data can justify previously chosen positions, but little more.

Debate Night Twitter: Analyzing Twitter’s Reaction to the Presidential Debate

Sunday, November 6th, 2016

Debate Night Twitter: Analyzing Twitter’s Reaction to the Presidential Debate by George McIntire.

A bit dated content-wise but George covers techniques, from data gathering to analysis, useful for future events. Possible Presidential inauguration riots on January 20, 2017 for example. Or, the 2017 Super Bowl, where Lady GaGa will be performing.

From the post:

This past Sunday, Donald Trump and Hillary Clinton participated in a town hall-style debate, the second of three such events in this presidential campaign. It was an extremely contentious affair that reverberated across social media.

The political showdown was massively anticipated; the negative atmosphere of the campaign and last week’s news of Trump making lewd comments about women on tape certainly contributed to the fire. Trump further escalated the immense tension by holding a press conference with women who’ve accused former President Bill Clinton of abusing.

With having a near unprecedented amount of attention and hostility, I wanted to gauge Twitter’s reaction to the event. In this project, I streamed tweets under the hashtag #debate and analyzed them to discover trends in Twitter’s mood and how users were reacting to not just the debate overall but to certain events in the debate.

What techniques will you apply to your tweet data sets?

Clinton/Podesta Map (through #30)

Saturday, November 5th, 2016

Charlie Grapski created Navigating Wikileaks: A Guide to the Podesta Emails.


The listing take 365 pages to date so this is just a tiny sample image.

I don’t have a legend for the row coloring but have tweeted to Charlie about the same.


ggplot2 cheatsheet updated – other R spreadsheets

Wednesday, November 2nd, 2016

RStudio Cheat Sheets

I saw a tweet that the ggplot2 cheatsheet has been updated.

Here’s a list of all the cheatsheets available at RStudio:

  • R Markdown Cheat Sheet
  • RStudio IDE Cheat Sheet
  • Shiny Cheat Sheet
  • Data Visualization Cheat Sheet
  • Package Development Cheat Sheet
  • Data Wrangling Cheat Sheet
  • R Markdown Reference Guide

Contributed Cheatsheets

  • Base R
  • Advanced R
  • Regular Expressions
  • How big is your graph? (base R graphics)

I have deliberately omitted links as when cheat sheets are updated, the links will break and/or you will get outdated information.

Use and reference the RStudio Cheat Sheets page.


Does Verification Matter? Clinton/Podesta Emails Update

Wednesday, November 2nd, 2016

As of today, 10,357 DKIM Verified Clinton/Podesta Emails (of 43,526 total). That’s releases 1-26.

I ask “Does Verification Matter?” in the title to this post because of the seeming lack of interest in verification of emails in the media. Not that it would ever be a lead, but some mention of the verified/not status of an email seems warranted.

Every Clinton/Podesta story mentions Antony Weiner’s interest in sharing his sexual insecurities and nary a peep about the false Clinton/Obama/Clapper claims that emails have been altered. Easy enough to check. But no specifics are given or requested by the press.

Thanks to the Clinton/Podesta drops by Michael Best, @NatSecGeek, I have now uploaded:

DKIM-verified-podesta-1-26.txt.gz is a sub-set of 10,357 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-26.txt.gz is the full set of Podesta emails to date, some 43,526, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7


PS: Perhaps verification doesn’t matter when the media repeats false and/or delusional statements of DNI Clapper in hopes of…, I don’t know what they are hoping for but I am hoping they are dishonest, not merely stupid.

9,477 DKIM Verified Clinton/Podesta Emails (of 39,878 total (today))

Monday, October 31st, 2016

Still working on the email graph and at the same time, managed to catch up on the Clinton/Podesta drops by Michael Best, @NatSecGeek, at least for a few hours.

DKIM-verified-podesta-1-24.txt.gz is a sub-set of 9,477 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-24.txt.gz is the full set of Podesta emails to date, some 39,878, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Question: Have you seen any news reports that mention emails being “verified” in their reporting?

Emails in the complete set may be as accurate as those in the verified set, but I would think verification is newsworthy in and of itself.


Parsing Emails With Python, A Quick Tip

Monday, October 31st, 2016

While some stuff runs in the background, a quick tip on parsing email with Python.

I got the following error message from Python:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 301, in parse
res = self._parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 349, in _parse
l = _timelex.split(timestr)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 143, in split
return list(cls(s))
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 137, in next
token = self.get_token()
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 68, in get_token
nextchar =
AttributeError: ‘NoneType’ object has no attribute ‘read’

I have edited the email header in question but it reproduces the original error:

Received: by with SMTP id w14cs34683wfw;
Wed, 5 Nov 2008 08:11:39 -0800 (PST)
Received: by with SMTP id r1mr728791wad.136.1225901498795;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received: from ( [])
by with ESMTP id m26si29354pof.3.2008.;
Wed, 05 Nov 2008 08:11:38 -0800 (PST)
Received-SPF: pass ( domain of designates
Received: from ([])
by with comcast
id bUBY1a0010b6N64A9UBeJl; Wed, 05 Nov 2008 16:11:38 +0000
Received: from ([])
by with comcast
id bUAV1a00L2JMgtY8PUAV7G; Wed, 05 Nov 2008 16:10:30 +0000
X-Authority-Analysis: v=1.0 c=1 a=1Ht49J2nGmlg0oY3xr8A:9
a=8nxvWDfACCTtBObdks-tTUtrMyYA:4 a=OA_lqj45gZcA:10 a=diNjy0DT58-4uIkuavEA:9
a=e0_VUgpf8QEu0XMU188OmzzKrzoA:4 a=37WNUvjkh6kA:10
Received: from [] by;
Wed, 05 Nov 2008 16:10:28 +0000

To: “Podesta” ,
CC: “Denis McDonough OFA” ,”,,
Subject: DOD leadership – immediate attention
Date: Wed, 05 Nov 2008 16:10:28 +0000
Message-Id: <110520081610.3048.4911C574000C2E2100000BE82216>
X-Mailer: AT&T Message Center Version 1 (Oct 30 2007)
X-Authenticated-Sender: c2V3YWxsY29ucm95QGNvbWNhc3QubmV0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=”NextPart_Webmail_9m3u9jl4l_3048_1225901428_0″

Content-Type: text/plain
Content-Transfer-Encoding: 8bit

I’m comparing “Date” to similar emails and getting no joy.

Absence is hard to notice, but once you know the rule, it’s obvious:

RFC822: Standard for ARPA Internet Text Messages says in part:

3. Lexical Analysis of Messages

3.1 General Description

A message consists of header fields and, optionally, a body. The body is simply a sequence of lines containing ASCII characters. It is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). (emphasis added)

Yep, the blank line I introduced while removing an errant double-quote on a line by itself, created the start for the body of the message.

Meaning that my Python script failed to find the “Date:” field and returning what someone thought would be a useful error message.

When you get errors parsing emails with Python (and I assume in other languages), check the format of your messages!

RFC822 has an appendix of parsing rules and a few examples.

Suggested listings of the most common email/email header format errors?

Clinton/Podesta Emails 23 and 24, True or False? Cryptographically Speaking

Monday, October 31st, 2016

Catching up on the Clinton/Podesta email releases from Wikileaks, via Michael Best, NatSecGeek. Michael bundles the releases up and posts them at: Podesta emails (zipped).

For anyone coming late to the game, DKIM “verified” means that the DKIM signature on an email is valid for that email.

In lay person’s terms, that email has been proven by cryptography to have originated from a particular mail server and when it left that mail server, it read exactly as it does now, i.e., no changes by Russians or others.

What I have created are files that lists the emails in the order they appear at Wikileaks, with the very next field being True or False on the verification issue.

Just because an email has “False” in the second column doesn’t mean it has been modified or falsified by the Russians.

DKIM signatures fail for all manner of reasons but when they pass, you have a guarantee the message is intact as sent.

For your research into these emails:




For release 24, I did have to remove the DKIM signature on 39256 00010187.eml in order for the script to succeed. That is the only modification I made to either set of files.

Clinton/Podesta Emails – Towards A More Complete Graph (Part 3) New Dump!

Sunday, October 30th, 2016

As you may recall from Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I didn’t check to see if “|” was in use as a separator in the extracted emails subject lines so when I tried to create node lists based on “|” as a separator, it failed.

That happens. More than many people are willing to admit.

In the meantime, a new dump of emails has arrived so I created the new DKIM-incomplete-podesta-1-22.txt.gz file. Which mean picking a new separator to use for the resulting file.

Advice: Check your proposed separator against the data file before using it. I forgot, you shouldn’t.

My new separator? |/|

Which I checked against the file to make sure there would be no conflicts.

The sed commands to remove < and > are the same as in Part 2.

Sigh, back to failure land, again.

Just as one sample:

awk 'FS="|/|" { print $7}'

where is:

9991 00013434.eml|/|False|/|2015-11-21 17:15:25-05:00|/|Eryn Sepp|/|John Podesta|/|Re: Nov 30 / Future Plans / Etc.!|/|


Future Plans

I also checked that with gawk and nawk, with the same result.

For some unknown (to me) reason, all three are treating the first “/” in field 6 (by my count) as a separator, along with the second “/” in that field.

To test that theory, what do you think { print $8 } will return?

You’re right!


So with the “|/|” separator, I’m going to have up to at least 9 fields, perhaps more, varying depending on whether “/” characters occur in the subject line.


That’s not going to work.

OK, so I toss the 10+ MB DKIM-complete-podesta-1-22.txt.gz into Emacs, whose regex treatment I trust, and change “|/|” to “@@@@@” and save that file as DKIM-complete-podesta-1-22-03.txt.

Another sanity check, which got us into all this trouble last time:

awk 'FS="@@@@@" { print $7}' podesta-1-22-03.txt | grep @ | wc -l

returns 36504, which plus the 16 files I culled as failures, equals 36520, the number of files in the Podesta 1-22 release.

Recall that all message-ids contain an @ sign to the correct answer on the number of files gives us confidence the file is ready for further processing.

Apologies for it taking this much prose to go so little a distance.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Our first node for the node list (Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)) was to capture the emails themselves.

Using Message-Id (field 7) as the identifier and Subject (field 6) as its label.

We are about to encounter another problem but let’s walk through it.

An example of what we are expecting:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg;”Re: Tomorrow”;

We have the Message-Id with a closing “;”, followed by the Subject, surrounded in double quote marks and also terminated by a “;”.

FYI: Mixing single and double quotes in awk is a real pain. I struggled with it but then was reminded I can declare variables:

-v dq='"'

which allows me to do this:

awk -v dq='"' 'FS="@@@@@" { print $7 ";" dq $6 dq ";"}' podesta-1-22-03.txt

The awk variable trick will save you considerable puzzling over escape sequences and the like.

Ah, now we are to the problem I mentioned above.

In the part 1 post I mentioned that while:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;;”Re: Tomorrow”;


but having:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg;”Knox Knotes”;;;”Re: Tomorrow”;;

with Wikileaks links is more convenient for readers.

As you may recall, the last two lines read:

9998 00022160.eml@@@@@False@@@@@2015-06-23 23:01:55-05:00@@@@@Jerome Tatar Tatar Jerome Knotes@@@@@CAC9z1zL9vdT+9FN7ea96r
9999 00013746.eml@@@@@False@@@@@2015-04-03 01:14:56-04:00@@@@@Eryn Sepp Podesta

Which means in addition to printing Message-Id and Subject as fields one and two, we need to split ID on the space and use the result to create the URL back to Wikileaks.

It’s late so I am going to leave you with DKIM-incomplete-podesta-1-22.txt.gz. This is complete save for 16 files that failed to parse. Will repost tomorrow with those included.

I have the first node file script working and that will form the basis for the creation of the edge lists.

PS: Look forward to running awk files tomorrow. It makes a number of things easier.

Exploding the Dakota Access Pipeline Target List

Sunday, October 30th, 2016

Who Is Funding the Dakota Access Pipeline? Bank of America, HSBC, UBS, Goldman Sachs, Wells Fargo by Amy Goodman and Juan Gonz´lez.

Great discussion of the funding behind the North Dakota Pipeline project!

They point to two important graphics to share, the first from: Who’s Banking on the Dakota Access Pipeline?:

view this map on LittleSis

Thanks for the easy embedding! (Best viewed at view this map on LittleSis)

And, from Who’s Banking on the Dakota Access Pipeline? (Food & Water Watch):


The full scale Food & Water Watch image.

Both visualizations are great ways to see some of the culprits responsible for the Dakota Access Pipeline, but not all.

Tracking the funding from Bakken Dakota Access Pipeline back to, among others, Citibank, Credit Agricole, ING Bank, and Natixis, should be a clue as to the next step.

All of the sources of financing, Citibank, Credit Agricole, ING Bank, Natixis, etc., are owned, one way or another, by investors. Moreover, as abstract entities, they cannot act without the use of agents, both as staff and as contractors.

If you take the financing entities as nodes (the first visualization), those should explode into both investor/owners and staff/agents, who do their bidding.

Thinking Citibank, for example, is too large and diffuse a target for effective political, social or economic pressure, but the smaller the part, the greater chance there is to have influence.

It’s true some nation states might be able to call Citibank to heel and if you can whistle one up, give it a shot. But while waiting on you to make you move, the rest of us should be looking for targets more within our reach.

That lesson, the one of the financiers exploding into more manageable targets (don’t overlook their political allies and their extended target lists), the same is true for the staffs and agents of Sunoco Logistics, Energy Transfer Partner, Energy Equity Transfer (a sister partnership to Energy Transfer Pipeline), and Bakken Dakota Access Pipeline.

I have yet to see an abstract entity drive a bulldozer, move pipe, etc. Despite the popular fiction that a corporation is a person, it’s somebody on the ground violating the earth, poisoning the water, disturbing sacred ground, all for the benefit of some other natural person.

Corporations, being abstract entities, cannot feel pressure. Their staffs and contractors, on the other hand, don’t have that immunity.

It will be post-election in the US, but I’m interested in demonstrating and assisting in demonstrating, how to explode these target lists in both directions.

As always, your comments and suggestions are most welcome.

PS: One source for picking up names and entities would be:, which self-describes as: is owned and operated by Energy Media Group which is based in Fargo North Dakota. was brought to life to fill a gap in the way that news was brought to the people on this specific energy niche. We wanted to have a place where people could come for all the news related to the Bakken Shale. Energy Media group owns several hundred websites, most of which surround shale formations all over the world.Our sister site went live at the beginning of December and already has a huge following. In the coming months, will be a place not only for news, but for jobs and classifieds as well. Thank-you for visiting us today, and make sure to sign up for our news letter to get the latest updates and also Like us on Facebook.

To give you the full flavor of their coverage: Oil pipeline protester accused of terrorizing officer:

North Dakota authorities have issued an arrest warrant accusing a pipeline protester on horseback of charging at a police officer.

Mason Redwing, of Fort Thompson, South Dakota, is wanted on felony charges of terrorizing and reckless endangerment in the Sept. 28 incident near St. Anthony. He’s also wanted on a previous warrant for criminal trespass.

The Morton County Sheriff’s Office says the officer shouted a warning and pointed a shotgun loaded with non-lethal beanbag rounds to defuse the situation.

I’m on a horse and you have a shotgun. Who is it that is being terrorized?

Yeah, it’s that kind of reporting.

Clinton/Podesta Emails – 1-22 – Progress Report

Saturday, October 29th, 2016

Michael Best has posted a bulk download of the Clinton/Podesta Eamils, 1-22.

Thinking (foolishly) that I could quickly pickup from where I left off yesterday, Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I grabbed the lasted bulk download and tossed the verification script against it.

In addition to the usual funnies of having to repair known defective files, again, I also underestimated how long verification of DKIM signatures takes on 36,000+ emails. Even on a fast box with plenty of memory.

At this point I have the latest release DKIM signatures parsed, but there are several files that fail for no discernible reason.

I’m going to have another go at it in the morning and should have part 3 of the graph work up tomorrow.

Apologies for the delay!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 2)

Friday, October 28th, 2016

I assume you are starting with DKIM-complete-podesta-1-18.txt.gz.

If you are starting with another source, you will need different instructions. 😉

First, remember from Clinton/Podesta Emails – Towards A More Complete Graph (Part 1) that I wanted to delete all the < and > signs from the text.

That’s easy enough (uncompress the text first):

sed 's/<//g' DKIM-complete-podesta-1-18.txt > DKIM-complete-podesta-1-18-01.txt

followed by:

sed 's/>//g' DKIM-complete-podesta-1-18-01.txt > DKIM-complete-podesta-1-18-02.txt

Here’s where we started:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner <>|”‘'” <>|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|<A28459BA2B4D5D49BED0238513058A7F012ADC1EF58F>
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin <>|hrcrapid <>|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|<>

Here’s the result after the first two sed scripts:

001 00032251.eml|False|2010-10-06 18:29:52-04:00|Joshua Dorner|”‘'”|[big campaign] Follow-up Materials from Background Briefing on the Chamber’s Foreign Funding, fyi|
002 00032146.eml|True|2015-04-14 18:19:46-04:00|Josh Schwerin|hrcrapid|=?UTF-8?Q?NYT=3A_Hillary_Clinton=E2=80=99s_Chipotle_Order=3A_Above_Avera?= =?UTF-8?Q?ge?=|

BTW, I increment the numbers of my result files, DKIM-complete-podesta-1-18-01.txt, DKIM-complete-podesta-1-18-02.txt, because when I don’t, I run different sed commands on the same original file, expecting a cumulative result.

That’s spelled – disappointment and wasted effort looking for problems that aren’t problems. Number your result files.

The nodes and edges mentioned in Clinton/Podesta Emails – Towards A More Complete Graph (Part 1):


  • Emails, message-id is ID and subject is label, make Wikileaks id into link
  • From/To, email addresses are ID and name is label
  • True/False, true/false as ID, True/False as labels
  • Date, truncated to 2015-07-24 (example), date as id and label


  • To/From – Edges with message-id (source) email-address (target) to/from (label)
  • Verify – Edges with message-id (source) true/false (target) verify (label)
  • Date – Edges with message-id (source) – date (target) date (label)

Am I missing anything? The longer I look at problems like this the more likely my thinking/modeling will change.

What follows is very crude, command line creation of the node and edge files. Something more elaborate could be scripted/written in any number of languages.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

You don’t have to take my word for it, try this:

awk 'FS="|" { print $7}' DKIM-complete-podesta-1-18-02.txt

The output prints to the console. Those all look like message-ids to me, well, with the exception of the one that reads ” 24 September.”

How much dirty data do you think is in the message-id field?

A crude starting assumption is that any message-id field without the “@” character is dirty.

Let’s try:

awk ‘FS = “|” { print $7} DKIM-complete-podesta-1-18-02.txt | grep -v @ | wc -l

Which means we are going to extract the 7th field, search (grep) over those results for the “@” sign, where the -v switch means only print lines that DO NOT match, and we will count those lines with wc -l.

Ready? Press return.

I get 594 “dirty” message-ids.

Here is a sampling:

Rolling Stone
MSNBC, Jeff Weaver interview on Sanders health care plan and his Wall St. ad
Texas Tribune
20160120 Burlington, IA Organizing Event
start spreadin’ the news…..
Building Trades Union (Keystone XL)
Rubio Hits HRC On Cuba, Russia
Sourcing Story
Drivers Licenses
Day 1
H1N1 Flu Shot Available

Those look an awful lot like subject lines to me. You?

I suspect those subject lines had the separator character “|” in those lines, before we extracted the data from the .eml files.

I’ve tried to repair the existing files but the cleaner solution is to return to the extraction script and the original email files.

More on that tomorrow!

Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)

Friday, October 28th, 2016

Gephi is a great tool, but it’s only as good as its input.

The Gephi 8.2 email importer (missing in Gephi 9.*) is lossy, informationally speaking, as I have mentioned before.

Here’s a sample from the verification results on podesta-release-1-18:

9981 00045326.eml|False|2015-07-24 12:58:16-04:00|Oren Shur |John Podesta , Robby Mook , Joel Benenson , “Margolis, Jim” , Mandy Grunwald , David Binder , Teddy Goff , Jennifer Palmieri , Kristina Schake , Christina Reynolds , Katie Connolly , “Kaye, Anson” , Peter Brodnitz , “Rimel, John” , David Dixon , Rich Davis , Marlon Marshall , Michael Halle , Matt Paul , Elan Kriegel , Jake Sullivan |FW: July IA Poll Results|<>

The Gephi 8.2 mail importer fails to create a node representing an email message.

I propose we cure that failure by taking the last field, here:

and the next to last field:

FW: July IA Poll Results

and putting them as id and label, respectively in a node list:; “FW: July IA Poll Results”;

As part of the transformation, we need to remove the < and > signs around the message ID, then add a ; to mark the end of the ID field and put double quote ” “ around the subject to use it as a label. Then close the second field with another ;.

While we are talking about nodes, all the email addresses change from:

Oren Shur

to:; “Oren Shur”;

which are ID and label of the node list, respectively.

I could remove the < and > characters as part of the extraction script but will use sed at the command line instead.

Reminder: Always work on a copy of your data, never the original.

Then we need to create an edge list, one that represents the relationships between the email (as node) to the sender and receivers of the email (also nodes). For this first iteration, I’m going to use labels on the edges to distinguish between senders and receivers.

Assuming my first row of the edges file reads:

Source; Target; Role (I did not use “Type” because I suspect that is a controlled term for Gephi.)

Then the first few edges would read:;>; from;;>; to;;; to;;; to;;; to;

As you can see, this is going to be a “busy” graph! 😉

Filtering is going to play an important role in exploring this graph, so let’s add nodes that will help with that process.

I propose we add to the node list:

true; True
false; False

as id and labels.

Which means for the edge list we can have:; true; verify;

Do you have an opinion on the order, source/target for true/false?

Thinking this will enable us to filter nodes that have not been verified or to include only those that have failed verification.

For experimental purposes, I think we need to rework the date field:

2015-07-24 12:58:16-04:00

I would truncate that to:


and add such truncated dates to the node list:

2015-07-24; 2015-07-24;

as ID and label, respectively.

Then for the edge list:; 2015-07-24; date;

Reasoning that we can filter to include/exclude nodes based on dates, which if you add enough email boxes, could help visualize the reaction to and propagation of emails.

Even assuming named participants in these emails have “deleted” their inboxes, there are always automatic backups. It’s just a question of persistence before the rest of this network can be fleshed out.

Oh, one last thing. You have probably notice the Wikileaks “ID” that forms part of the filename?

9981 00045326.eml

The first part forms the end of a URL to link to the original post at Wikileaks.

Thus, in this example, 9981 becomes:

The general form being:

For the convenience of readers/users, I want to modify my earlier proposal for the email node list entry from:; “FW: July IA Poll Results”;

to:; “FW: July IA Poll Results”;;

Where the third field is “link.”

I am eliding over lots of relationships and subjects but I’m not reluctant to throw it all away and start over.

Your investment in a model isn’t lost by tossing the model, you learn something with every model you build.

Scripting underway, a post on that experience and the node/edge lists to follow later today.

Podesta/Clinton Emails: Filtering by Email Address (Pimping Bill Clinton)

Thursday, October 27th, 2016

The Bill Clinton, Inc. story reminds me of:

Although I steadfastly resist imaging either Bill or Hillary in that video. Just won’t go there!

Where a graph display can make a difference is that instead of just the one email/memo from Bill’s pimp, we can rapidly survey all of the emails in which he appears, in any role.


I ran that on Gephi 8.2 against podesta-release-1-18 but the results were:

Nodes 0, Edges 0.

Hmmm, there is something missing, possibly the CSV file?

I checked and podesta-release-1-18 has 393 emails where appears.

Could try to find the “right” way to filter on email addresses but for now, let’s take a dirty short-cut.

I created a directory to hold all emails with and ran into all manner of difficulties because the file names are plagued with spaces!

So much so that I unintentionally (read “by mistake”) saved all the posts from podesta-release-1-18 to a different folder than the ones from podesta-release-19.


Well, but there is a happy outcome and an illustration of yet another Gephi capability.

I build the first graph from the posts from podesta-release-1-18 and then with that graph open, imported the from podesta-release-19 and appended those results to the open graph.

How cool is that!

Imagine doing that across data sets, assuming you paid close attention to identifiers, etc.

Sorry, back to the graphs, here is the random layout once the graphs were combined:


Applying the Yifan Hu network layout:


I ran network statistics on network diameter and applied colors based on betweenness:


And finally, adjusted the font and turned on the labels:


I have spent a fair amount of time just moving stuff about but imagine if you could interactively explore the emails, creating and trashing networks based on to:, from:, cc:, dates, content, etc.

The limits of Gephi imports were a major source of pain today.

I’m dodging those tomorrow in favor of creating node and adjacency tables with scripts.

PS: Don’t John Podesta and Doug Band look like two pimps in a pod? 😉

PPS: If you haven’t read the pimping Bill Clinton memo. (I think it has some other official title.)

Clinton/Podesta 19, DKIM-verified-podesta-19.txt.gz, DKIM-complete-podesta-19.txt.gz

Wednesday, October 26th, 2016

Michael Best, @NatSecGeek, posted release 19 of the Clinton/Podesta emails at: today.

A total of 1518 emails, zero (0) of which broke my script!

Three hundred and sixty-three were DKIM verified! DKIM-verified-podesta-19.txt.gz.

The full set of emails, verified and not: DKIM-complete-podesta-19.txt.gz.

I’m still pondering how to best organize the DKIM verified material for access.

I could segregate “verified” emails for indexing. So any “hits” from those searches are from “verified” emails?

Ditto for indexing only attachments of “verified” emails.

What about a graph constructed solely from “verified” emails?

Or should I make verified a property of the emails as nodes? Reasoning that aside from exploring the email importation in Gephi 8.2, it would not be that much more difficult to build node and adjacency lists from the raw emails.


Serious request for help.

Like Gollum, I know what I keep in my pockets, but I have no idea what other people keep in theirs.

What would make this data useful to you?

Clinton/Podesta 1-18,,

Tuesday, October 25th, 2016

After a long day of waiting for scripts to finish and re-running them to cross-check the results, I am happy to present:

DKIM-verified-podesta-1-18.txt.gz, which consists of the Podesta emails (7526) which returned true for a test of their DKIM signature.

The complete set of the results for all 31,819 emails, can be found in:


An email that has been “verified” has a cryptographic guarantee that it was sent even as it appears to you now.

An email that fails verification, may be just as trustworthy, but its DKIM signature has failed for any number of reasons.

One of my motivations for classifying these emails is to enable the exploration of why DKIM verification failed on some of these emails.

Question: What would make this data more useful/accessible to journalists/bloggers?

I ask because dumping data and/or transformations of data can be useful, it is synthesizing data into a coherent narrative that is the essence of journalism/reporting.

I would enjoy doing the first in hopes of furthering the second.

PS: More emails will be added to this data set as they become available.

Corrupt (fails with my script) files in Clinton/Podesta Emails (14 files out of 31,819)

Tuesday, October 25th, 2016

You may use some other definition of “file corruption” but that’s mine and I’m sticking to it.


The following are all the files that failed against my script and the actions I took to proceed with parsing the files. Not today but I will make a sed script to correct these files as future accumulations of emails appear.

13544 00047141.eml

Date string parse failed:

Date: Wed, 17 Dec 2008 12:35:42 -0700 (GMT-07:00)

Deleted (GMT-07:00).

15431 00059196.eml

Date string parse failed:

Date: Tue, 22 Sep 2015 06:00:43 +0800 (GMT+08:00)

Deleted (GMT+8:00).

155 00049680.eml

Date string parse failed:

Date: Mon, 27 Jul 2015 03:29:35 +0000

Assuming, as the email reports, was the sender and was the intended receiver, then the offset from UT is clearly wrong (+0000).

Deleted +0000.

6793 00059195.eml

Date string parse fail:

Date: Tue, 22 Sep 2015 05:57:54 +0800 (GMT+08:00)

Deleted (GTM+08:00).

9404 0015843.eml DKIM failure

All of the DKIM parse failures take the form:

Traceback (most recent call last):
File “”, line 18, in
verified = dkim.verify(data)
File “/usr/lib/python2.7/dist-packages/dkim/”, line 604, in verify
return d.verify(dnsfunc=dnsfunc)
File “/usr/lib/python2.7/dist-packages/dkim/”, line 506, in verify
File “/usr/lib/python2.7/dist-packages/dkim/”, line 181, in validate_signature_fields
if int(sig[b’x’]) < int(sig[b't']): KeyError: 't'

I simply deleted the DKIM-Signature in question. Will go down that rabbit hole another day.

21960 00015764.eml

DKIM signature parse failure.

Deleted DKIM signature.

23177 00015850.eml

DKIM signature parse failure.

Deleted DKIM signature.

23728 00052706.eml

Invalid character in RFC822 header.

I discovered an errant ‘”‘ (double quote mark) at the start of a line.

Deleted the double quote mark.

And deleted ^M line endings.

25040 00015842.eml

DKIM signature parse failure.

Deleted DKIM signature.

26835 00015848.eml

DKIM signature parse failure.

Deleted DKIM signature.

28237 00015840.eml

DKIM signature parse failure.

Deleted DKIM signature.

29052 0001587.eml

DKIM signature parse failure.

Deleted DKIM signature.

29099 00015759.eml

DKIM signature parse failure.

Deleted DKIM signature.

29593 00015851.eml

DKIM signature parse failure.

Deleted DKIM signature.

Here’s an odd pattern for you, all nine (9) of the fails to parse the DKIM signatures were on mail originating from:

From: Gene Karpinski

But there are approximately thirty-three (33) emails from Karpinski so it doesn’t fail every time.

The file numbers are based on the 1-18 distribution of Podesta emails created by Michael Best, @NatSecGeek, at: Podesta Emails (zipped).

Finding “unknown string format” in 1.7 GB of files – Parsing Clinton/Podesta Emails

Tuesday, October 25th, 2016

Testing my “dirty” script against Podesta Emails (1.7 GB), some 17,296 files, I got the following message:

Traceback (most recent call last):
File “”, line 20, in
date = dateutil.parser.parse(msg[‘date’])
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 697, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File “/usr/lib/python2.7/dist-packages/dateutil/”, line 303, in parse
raise ValueError, “unknown string format”
ValueError: unknown string format

Now I have to find the file that broke the script.

Beginning Python programmers are laughing at this point because they know using:

for name in glob.glob('*.eml'):

is going to make finding the offending file difficult.


Consulting the programming oracle (Stack Overflow) on ordering of glob.glob in Python I learned:

By checking the source code of glob.glob you see that it internally calls os.listdir, described here:

Key sentence: os.listdir(path) Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

Arbitrary order. 🙂

Interesting but not quite an actionable answer!

Take a look out:

Order is arbitrary, but you can sort them yourself

If you want sorted by name:


sorted by modification time:

import os
sorted(glob.glob('*.png'), key=os.path.getmtime)

sorted by size:

import os
sorted(glob.glob('*.png'), key=os.path.getsize)


So for ease in finding the offending file(s) I adjusted:

for name in glob.glob('*.eml'):


for name in sorted(glob.glob('*.eml')):

Now I can tail the results file in question and the next file is where the script failed.

More on the files that failed in a separate post.

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Monday, October 24th, 2016

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.


import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data =
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")


Download podesta-test.tar.gz, unpack that to a directory and then save/uppack to the same directory, then:


Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

Serious Non-Transparency (+ work around)

Tuesday, March 29th, 2016

I mentioned yesterday in my post: Courses -> Texts: A Hidden Relationship, where I lamented the inability to find courses by their titles.

So you could easily discover the required/suggested texts for any given course. Like browsing a physical campus bookstore.

Obscurity is an “information smell” (to build upon Felienne‘s expansion of code smell to spreadsheets).

In this particular case, the “information smell” is skunk class.

I revisited today to extract its > 1200 bookstores for use in crawling a sample of those sites.

For ugly HTML, view the source of:

Parsing that is going to take time and surely there is an easy way to get a sample of the sites for mining.

The idea didn’t occur to me immediately but I noticed yesterday that the general form of web addresses was:

So, after some flailing about with the HTML from, I searched for “” and requested all the results.

I’m picking a random ten bookstores with law books for further searching.

Not a high priority but I am curious what lies behind the smoke, mirrors, complex HTML and poor interfaces.

Maybe something, maybe nothing. Won’t know unless we look.

PS: Perhaps a better query string: textbooks-and-course-materials

Suggested refinements?

Data Mining Patterns in Crossword Puzzles [Patterns in Redaction?]

Saturday, March 5th, 2016

A Plagiarism Scandal Is Unfolding In The Crossword World by Oliver Roeder.

From the post:

A group of eagle-eyed puzzlers, using digital tools, has uncovered a pattern of copying in the professional crossword-puzzle world that has led to accusations of plagiarism and false identity.

Since 1999, Timothy Parker, editor of one of the nation’s most widely syndicated crosswords, has edited more than 60 individual puzzles that copy elements from New York Times puzzles, often with pseudonyms for bylines, a new database has helped reveal. The puzzles in question repeated themes, answers, grids and clues from Times puzzles published years earlier. Hundreds more of the puzzles edited by Parker are nearly verbatim copies of previous puzzles that Parker also edited. Most of those have been republished under fake author names.

Nearly all this replication was found in two crosswords series edited by Parker: the USA Today Crossword and the syndicated Universal Crossword. (The copyright to both puzzles is held by Universal Uclick, which grew out of the former Universal Press Syndicate and calls itself “the leading distributor of daily puzzle and word games.”) USA Today is one of the country’s highest-circulation newspapers, and the Universal Crossword is syndicated to hundreds of newspapers and websites.

On Friday, a publicity coordinator for Universal Uclick, Julie Halper, said the company declined to comment on the allegations. FiveThirtyEight reached out to USA Today for comment several times but received no response.

Oliver does a great job setting up the background on crossword puzzles and exploring the data that underlies this story. A must read if you are interested in crossword puzzles or know someone who is.

I was more taken with “how” the patterns were mined, which Oliver also covers:

Tausig discovered this with the help of the newly assembled database of crossword puzzles created by Saul Pwanson [1. Pwanson changed his legal name from Paul Swanson] a software engineer. Pwanson wrote the code that identified the similar puzzles and published a list of them on his website, along with code for the project on GitHub. The puzzle database is the result of Pwanson’s own Web-scraping of about 30,000 puzzles and the addition of a separate digital collection of puzzles that has been maintained by solver Barry Haldiman since 1999. Pwanson’s database now holds nearly 52,000 crossword puzzles, and Pwanson’s website lists all the puzzle pairs that have a similarity score of at least 25 percent.

The .xd futureproof crossword format page reads in part:

.xd is a corpus-oriented format, modeled after the simplicity and intuitiveness of the markdown format. It supports 99.99% of published crosswords, and is intended to be convenient for bulk analysis of crosswords by both humans and machines, from the present and into the future.

My first thought was of mining patterns in government redacted reports.

My second thought was that an ASCII format that specifies line length (to allow for varying font sizes) in characters, plus line breaks and lines composed of characters, whitespace and markouts as single characters should fit the bill. Yes?

Surely such a format exists now, yes? Pointers please!

There are those who merit protection by redacted documents, but children are more often victimized by spy agencies than employed by them.

9 “Laws” for Data Mining [Be Careful With #5]

Saturday, January 30th, 2016

9 “Laws” for Data Mining

A Forbes piece on “laws” for data mining, that are equally applicable to data science.

Being Forbes, technology is valuable because it has value for business, not because “everyone is doing it,” “it’s really cool technology,” “it’s a graph,” or “it will bring all humanity to a new plane of existence.”

To be honest, Forbes is a welcome relief some days.

But even Forbes stumbles, as with law #5:

5. There are always patterns: In practice, your data always holds useful information to support decision-making and action.

What? “…your data always holds useful information to support decision-making and action.

That’s as nutty as the “new plane of existence” stuff.

When I say “nutty,” I mean that in a professional sense. The term apohenia was coined to label the tendency to see meaningful patterns in random data. (Yes, that includes your data.) Apophenia.

The original work described the “…onset of delusional thinking in pyschosis.”

No doubt you will find patterns in your data but that the patterns “…holds useful information to support decision-making and action” isn’t a given.

That is an echo of the near fanatic belief that if businesses used big data, they would be more profitable.

Most of the other “laws” are more plausible than #5, but even there, don’t abandon your judgement even if Forbes says that something is so.

I first saw this in a tweet by Data Science Renee.

Amazon Top 20 Books in Data Mining – 18? Low Quality Listicle?

Monday, January 25th, 2016

Amazon Top 20 Books in Data Mining by Matthew Mayo.

Matthew’s bio says:

Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist.

So, puzzle me this:

  • Why does this listicle have “Data Science From Scratch: First Principles with Python” by Joel Grus, listed twice?
  • Why does David Pogue’s “iPhone: The Missing Manual” appear in this list?

“Data Science From Scratch: First Principles with Python” appears twice because one is paperback and the other is Kindle. Amazon treats those as separate subjects for sales purposes, although to a reader they are more likely a single subject, which has several formats.

The appearance of “iPhone: The Missing Manual” in this listing is a category error.

If you want to generate unproofed listicles of bestsellers, start with the Amazon best link for computer science or choose one of its many sub-categories such as data mining.

The measure of a listicle isn’t how easy it was to generate but how useful it is to the targeted community.

Duplication and irrelevant results detract from the usefulness of a listicle.


Playboy Exposed [Complete Archive]

Wednesday, December 30th, 2015

Playboy Exposed by Univision’s Data Visualization Unit.

From the post:

The first time Pamela Anderson got naked for a Playboy cover, with a straw hat covering her inner thighs, she was barely 22 years old. It was 1989 and the magazine was starting to favor displaying young blondes on its covers.

On Friday, December 11, 2015, a quarter century later, the popular American model, now 48, graced the historical last nude edition of the magazine, which lost the battle for undress and decided to cover up its women in order to survive.

Univision Noticias analyzed all the covers published in the US, starting with Playboy’s first issue in December 1953, to study the cover models’ physical attributes: hair and skin color, height, age and body measurements. With these statistics, a model of the prototype woman for each decade emerged. It can be viewed in this interactive special.

I’ve heard people say they bought Playboy magazine for the short stories but this is my first time to hear of someone just looking at the covers. 😉

The possibilities for analysis of Playboy and its contents are nearly endless.

Consider the history of “party jokes” or “Playboy Advisor,” not to mention the cartoons in every issue.

I did check the Playboy Store but wasn’t about to find a DVD set with all the issues.

You can subscribe to Playboy Archive for $8.00 a month and access every issue from the first issue to the current one.

I don’t have a subscription so I not sure how you would do the OCR to capture the jokes.

Great R packages for data import, wrangling & visualization [+ XQuery]

Tuesday, December 29th, 2015

Great R packages for data import, wrangling & visualization by Sharon Machlis.

From the post:

One of the great things about R is the thousands of packages users have written to solve specific problems in various disciplines — analyzing everything from weather or financial data to the human genome — not to mention analyzing computer security-breach data.

Some tasks are common to almost all users, though, regardless of subject area: data import, data wrangling and data visualization. The table below show my favorite go-to packages for one of these three tasks (plus a few miscellaneous ones tossed in). The package names in the table are clickable if you want more information. To find out more about a package once you’ve installed it, type help(package = "packagename") in your R console (of course substituting the actual package name ).

Forty-seven (47) “favorites” sounds a bit on the high side but some people have more than one “favorite” ice cream, or obsession. 😉

You know how I feel about sort-order and I could not detect an obvious one in Sharon’s listing.

So, I extracted the package links/name plus the short description into a new table:

car data wrangling
choroplethr mapping
data.table data wrangling, data analysis
devtools package development, package installation
downloader data acquisition
dplyr data wrangling, data analysis
DT data display
dygraphs data visualization
editR data display
fitbitScraper misc
foreach data wrangling
ggplot2 data visualization
gmodels data wrangling, data analysis
googlesheets data import, data export
googleVis data visualization
installr misc
jsonlite data import, data wrangling
knitr data display
leaflet mapping
listviewer data display, data wrangling
lubridate data wrangling
metricsgraphics data visualization
openxlsx misc
plotly data visualization
plotly data visualization
plyr data wrangling
psych data analysis
quantmod data import, data visualization, data analysis
rcdimple data visualization
RColorBrewer data visualization
readr data import
readxl data import
reshape2 data wrangling
rga Web analytics
rio data import, data export
RMySQL data import
roxygen2 package development
RSiteCatalyst Web analytics
rvest data import, web scraping
scales data wrangling
shiny data visualization
sqldf data wrangling, data analysis
stringr data wrangling
tidyr data wrangling
tmap mapping
XML data import, data wrangling
zoo data wrangling, data analysis


I want to use XQuery at least once a day in 2016 on my blog. To keep myself honest, I will be posting any XQuery I use.

To sort and extract two of the columns from Mary’s table, I copied the table to a separate file and ran this XQuery:

  1. xquery version “1.0”;
  2. <html>
  3. <table>{
  4. for $row in doc(“/home/patrick/working/favorite-R-packages.xml”)/table/tr
  5. order by lower-case(string($row/td[1]/a))
  6. return <tr>{$row/td[1]} {$row/td[2]}</tr>
  7. }</table>
  8. </html>

One of the nifty aspects of XQuery is that you can sort, as on line 5, in all lower-case on the first <td> element, while returning the same element as written in the original table. Which gives better (IMHO) sort order than UPPERCASE followed by lowercase.

This same technique should make you the master of any simple tables you encounter on the web.

PS: You should always acknowledge the source of your data and the original author.

I first saw Sharon’s list in a tweet by Christophe Lalanne.

Data scientists: Question the integrity of your data [Relevance/Fitness – Not “Integrity”]

Saturday, December 12th, 2015

Data scientists: Question the integrity of your data by Rebecca Merrett.

From the post:

If there’s one lesson website traffic data can teach you, it’s that information is not always genuine. Yet, companies still base major decisions on this type of data without questioning its integrity.

At ADMA’s Advancing Analytics in Sydney this week, Claudia Perlich, chief scientist of Dstillery, a marketing technology company, spoke about the importance of filtering out noisy or artificial data that can skew an analysis.

“Big data is killing your metrics,” she said, pointing to the large portion of bot traffic on websites.

“If the metrics are not really well aligned with what you are truly interested in, they can find you a lot of clicking and a lot of homepage visits, but these are not the people who will buy the product afterwards because they saw the ad.”

Predictive models that look at which users go to some brands’ home pages, for example, are open to being completely flawed if data integrity is not called into question, she said.

“It turns out it is much easier to predict bots than real people. People write apps that skim advertising, so a model can very quickly pick up what that traffic pattern of bots was; it can predict very, very well who would go to these brands’ homepages as long as there was bot traffic there.”

The predictive model in this case will deliver accurate results when testing its predictions. However, that doesn’t bring marketers or the business closer to reaching its objective of real human ad conversions, Perlich said.

The on-line Merriam-Webster’s defined “integrity” as:

  1. firm adherence to a code of especially moral or artistic values : incorruptibility
  2. an unimpaired condition : soundness
  3. the quality or state of being complete or undivided : completeness

None of those definitions of “integrity” apply to the data Perlich describes.

What Perlich criticizes is measuring data with no relationship to the goal of the analysis, “…human ad conversions.”

That’s not “integrity” of data. Perhaps appropriate/fitness for use or relevance but not “integrity.”

Avoid vague and moralizing terminology when discussing data and data science.

Discussions of ethics are difficult enough without introducing confusion with unrelated issues.

I first saw this in a tweet by Data Science Renee.