Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 11, 2016

“connecting the dots” requires dots (Support Michael Best)

Filed under: Government Data,Politics,Transparency — Patrick Durusau @ 9:45 pm

Michael Best is creating a massive archive of government documents.

From the post:

Since 2015, I’ve published millions of government documents (about 10% of the text items on the Internet Archive, with some items containing thousands of documents) and terabytes of data; but in order to keep going, I need your help. Since I’ve gotten started, no outlet has matched the number of government documents that I’ve published and made freely available. The only non-governmental publisher that rivals the size and scope of the government files I’ve uploaded is WikiLeaks. While I analyze and write about these documents, I consider publishing them to be more important because it enables and empowers an entire generation of journalists, researchers and students of history.

I’ve also pressured government agencies into making their documents more widely available. This includes the more than 13,000,000 pages of CIA documents that are being put online soon, partially in response to my Kickstarter and publishing efforts. These documents are coming from CREST, which is a special CIA database of declassified records. Currently, it can only be accessed from four computers in the world, all of them just outside of Washington D.C. These records, which represent more than 3/4 of a million CIA files, will soon be more accessible than ever – but even once that’s done, there’s a lot more work left to do.

Question: Do you want a transparent and accountable Trump presidency?

Potential Answers include:

1) Yes, but I’m going to spend time and resources hyper-ventilating with others and roaming the streets.

2) Yes, and I’m going to support Michael Best and FOIA efforts.

Governments, even Trump’s presidency, don’t spring from ocean foam.

1024px-sandro_botticelli_-_la_nascita_di_venere_-_google_art_project_-_edited-460

The people chosen fill cabinet and other posts have history, in many cases government history.

For example, I heard a rumor today that Ed Meese, a former government crime lord, is on the Trump transition team. Hell, I thought he was dead.

Michael’s efforts produce the dots that connect past events, places, people, and even present administrations.

The dots Michael produces may support your expose, winning story and/or indictment.

Are you in or out?

November 9, 2016

Trump Wins! Trump Wins! A Diversity Lesson For Data Scientists

Filed under: Diversity,Government,Politics,Statistics — Patrick Durusau @ 9:50 pm

Here’s Every Major Poll That Got Donald Trump’s Election Win Wrong by Brian Flood.

From the post:

When Donald Trump shocked the world to become the president-elect on Tuesday night, the biggest loser wasn’t his opponent Hillary Clinton, it was the polling industry that tricked America into thinking we’d be celebrating the first female president right about now.

The polls, which Trump has been calling inaccurate and rigged for months, made it seem like Clinton was a lock to occupy the White House come January.

Nate Silver’s FiveThirtyEight is supposed to specialize in data-based journalism, but the site reported on Tuesday morning that Clinton had a 71.4 percent chance of winning the election. The site was wrong about the outcome in major battleground states including Florida, North Carolina and Pennsylvania, and Trump obviously won the election in addition to the individual states that were supposed to vote Clinton. Silver wasn’t the only pollster to botch the 2016 election.

Trump’s victory should teach would be data scientists this important lesson:

Diversity is important in designing data collection

Some of the reasons given for the failure of prediction in this election:

  1. People without regular voting records voted.
  2. People polled weren’t honest about their intended choices.
  3. Pollster’s weren’t looking for a large, angry segment of the population.

All of which can be traced back to a lack of imagination/diversity in the preparation of the polling instruments.

Ironic isn’t it?

Strive for diversity, including people whose ideas you find distasteful.

Such as vocal Trump supporters. (Substitute your favorite villain.)

November 7, 2016

Election for Sale

Filed under: Government,MapD,Mapping,Politics — Patrick Durusau @ 8:23 pm

Election for Sale by Keir Clarke.

mapsmania2-460

MapD’s US Political Donations map allows you to explore the donations made to the Democratic and Republican parties dating back to 2001. The map includes a number of tools which allow you to filter the map by political party, by recipient and by date.

After filtering the map by party and date you can explore details of the donations received using the markers on the map. If you select the colored markers on the map you can view details on the amount of the donation, the name of the recipient & recipient’s party and the name of the donor. It is also possible to share a link to your personally filtered map.

The MapD blog has used the map to pick out a number of interesting stories that emerge from the map. These stories include an analysis of the types of donations received by both Hilary Clinton and Donald Trump.

An appropriate story for November 7th, the day prior to the U.S. Government sale day, November 8th.

It’s a great map but that isn’t to say it could not be enhanced by merging in other data.

While everyone acknowledges donations, especially small ones, are made for a variety of reasons, consistent and larger donations are made with an expectation of something in return.

One feature this map is missing is what did consistent and larger donors get in return?

Harder to produce and maintain than a map based on public campaign donation records but far more valuable to the voting public.

Imagine that level of transparency for the tawdry story of Hillary Clinton and Big Oil. How Hillary Clinton’s State Department Fought For Oil 5,000 Miles Away.

Apparent Browser Incompatibility: The MapD map loads fine with Firefox (49.0.2) but crashes with Chrome (Version 54.0.2840.90 (64-bit)) (Failed to load dashboard. TypeError: Cannot read property ‘resize’ of undefined). Both on Ubuntu 14.04.

Drip, Drip, Drip, Leaking At Wikileaks

Filed under: Hillary Clinton,Politics,Wikileaks — Patrick Durusau @ 3:58 pm

wikileaks-dnc-460

Two days before the U.S. Presidential election, Wikileaks released 8,200 emails from the Democratic National Committee (DNC). Which were in addition to its daily drip, drip, drip leaking of emails from John Podesta, Hillary Clinton’s campaign chair.

The New York Times, a sometimes collaborator with Wikileaks (The War Logs (NYT)), has sponsored a series of disorderly and nearly incoherent attacks on Wikileaks for these leaks.

The dominant theme in those attacks is that readers should not worry their shallow and insecure minds about social media but rely upon media outlets to clearly state any truth readers need to know.

I am not exaggerating. The exact language that appears in one such attack was:

…people rarely act like rational, civic-minded automatons. Instead, we are roiled by preconceptions and biases, and we usually do what feels easiest — we gorge on information that confirms our ideas, and we shun what does not.

Is that how you think of yourself? It is how the New York Times thinks about you.

There are legitimate criticisms concerning Wikileaks and its drip, drip, drip leaking but the Times manages to miss all of them.

For example, the daily drops of Podesta emails, selected on some “unknown to the public” criteria, prevented the creation of a coherent narrative by reporters and the public. The next day’s leak might contain some critical link, or not.

Reporters, curators and the public were teased with drips and drabs of information, which served to drive traffic to the Wikileaks site, traffic that serves no public interest.

If that sounds ungenerous, consider that as the game draws to a close, that Wikileaks has finally posted a link to the Podesta emails in bulk: https://file.wikileaks.org/file/podesta-emails/.

To be sure, some use has been made of the Podesta emails, my work and that of others on DKIM signatures (unacknowledged by Wikileaks when it “featured” such verification on email webpages), graphs, etc. but early bulk release of the emails would have enabled much more.

For example:

  • Concordances of the emails and merging those with other sources
  • Connecting the dots to public or known only to others data
  • Entity recognition and linking in extant resources and news stories
  • Fitting the events, people, places into coherent narratives
  • Sentiment analysis
  • etc.

All of that lost because of the “Wikileaks look at me” strategy for releasing the Podesta emails.

I applaud Wikileaks obtaining and leaking data, including the Podesta emails, but, a look at me strategy impairs the full exploration and use of leaked data.

Is that really the goal of Wikileaks?

PS: If you are interested in leaking without games or redaction, ping me. I’m interested in helping with such leaks.

November 5, 2016

Freedom of Speech/Press – Great For “Us” – Not So Much For You (Wikileaks)

Filed under: Free Speech,Politics,Social Media,Wikileaks — Patrick Durusau @ 8:33 pm

The New York Times, sensing a possible defeat of its neo-liberal agenda on November 8, 2016, has loosed the dogs of war on social media in general and Wikileaks in particular.

Consider the sleight of hand in Farhad Manjoo’s How the Internet Is Loosening Our Grip on the Truth, which argues on one hand,


You’re Not Rational

The root of the problem with online news is something that initially sounds great: We have a lot more media to choose from.

In the last 20 years, the internet has overrun your morning paper and evening newscast with a smorgasbord of information sources, from well-funded online magazines to muckraking fact-checkers to the three guys in your country club whose Facebook group claims proof that Hillary Clinton and Donald J. Trump are really the same person.

A wider variety of news sources was supposed to be the bulwark of a rational age — “the marketplace of ideas,” the boosters called it.

But that’s not how any of this works. Psychologists and other social scientists have repeatedly shown that when confronted with diverse information choices, people rarely act like rational, civic-minded automatons. Instead, we are roiled by preconceptions and biases, and we usually do what feels easiest — we gorge on information that confirms our ideas, and we shun what does not.

This dynamic becomes especially problematic in a news landscape of near-infinite choice. Whether navigating Facebook, Google or The New York Times’s smartphone app, you are given ultimate control — if you see something you don’t like, you can easily tap away to something more pleasing. Then we all share what we found with our like-minded social networks, creating closed-off, shoulder-patting circles online.

This gets to the deeper problem: We all tend to filter documentary evidence through our own biases. Researchers have shown that two people with differing points of view can look at the same picture, video or document and come away with strikingly different ideas about what it shows.

You caught the invocation of authority by Manjoo, “researchers have shown,” etc.

But did you notice he never shows his other hand?

If the public is so bat-shit crazy that it takes all social media content as equally trustworthy, what are we to do?

Well, that is the question isn’t it?

Manjoo invokes “dozens of news outlets” who are tirelessly but hopelessly fact checking on our behalf in his conclusion.

The strong implication is that without the help of “media outlets,” you are a bundle of preconceptions and biases doing what feels easiest.

“News outlets,” on the other hand, are free from those limitations.

You bet.

If you thought Majoo was bad, enjoy seething through Zeynep Tufekci’s claims that Wikileaks is an opponent of privacy, sponsor of censorship and opponent of democracy, all in a little over 1,000 words (1069 exact count). Wikileaks Isn’t Whistleblowing.

It’s a breath taking piece of half-truths.

For example, playing for your sympathy, Tufekci invokes the need of dissidents for privacy. Even to the point of invoking the ghost of the former Soviet Union.

Tufekci overlooks and hopes you do as well, that these emails weren’t from dissidents, but from people who traded in and on the whims and caprices at the pinnacles of American power.

Perhaps realizing that is too transparent a ploy, she recounts other data dumps by Wikileaks to which she objects. As lawyers say, if the facts are against you, pound on the table.

In an echo of Manjoo, did you know you are too dumb to distinguish critical information from trivial?

Tufekci writes:


These hacks also function as a form of censorship. Once, censorship worked by blocking crucial pieces of information. In this era of information overload, censorship works by drowning us in too much undifferentiated information, crippling our ability to focus. These dumps, combined with the news media’s obsession with campaign trivia and gossip, have resulted in whistle-drowning, rather than whistle-blowing: In a sea of so many whistles blowing so loud, we cannot hear a single one.

I don’t think you are that dumb.

Do you?

But who will save us? You can guess Tufekci’s answer, but here it is in full:


Journalism ethics have to transition from the time of information scarcity to the current realities of information glut and privacy invasion. For example, obsessively reporting on internal campaign discussions about strategy from the (long ago) primary, in the last month of a general election against a different opponent, is not responsible journalism. Out-of-context emails from WikiLeaks have fueled viral misinformation on social media. Journalists should focus on the few important revelations, but also help debunk false misinformation that is proliferating on social media.

If you weren’t frightened into agreement by the end of her parade of horrors:


We can’t shrug off these dangers just because these hackers have, so far, largely made relatively powerful people and groups their targets. Their true target is the health of our democracy.

So now Wikileaks is gunning for democracy?

You bet. 😉

Journalists of my youth, think Vietnam, Watergate, were aggressive critics of government and the powerful. The Panama Papers project is evidence that level of journalism still exists.

Instead of whining about releases by Wikileaks and others, journalists* need to step up and provide context they see as lacking.

It would sure beat the hell out of repeating news releases from military commanders, “justice” department mouthpieces, and official but “unofficial” leaks from the American intelligence community.

* Like any generalization this is grossly unfair to the many journalists who work on behalf of the public everyday but lack the megaphone of the government lapdog New York Times. To those journalists and only them, do I apologize in advance for any offense given. The rest of you, take such offense as is appropriate.

Clinton/Podesta Map (through #30)

Filed under: Data Mining,Hillary Clinton,Politics,Wikileaks — Patrick Durusau @ 7:26 pm

Charlie Grapski created Navigating Wikileaks: A Guide to the Podesta Emails.

podesta-map-grapski-460

The listing take 365 pages to date so this is just a tiny sample image.

I don’t have a legend for the row coloring but have tweeted to Charlie about the same.

Enjoy!

November 1, 2016

How To DeDupe Clinton/Weiner/Abedin Emails….By Tomorrow

Filed under: FBI,Hillary Clinton,Politics — Patrick Durusau @ 1:43 pm

The report by Haliman Abdullah, FBI Working to Winnow Through Emails From Anthony Weiner’s Laptop, casts serious doubt on the technical prowess of the FBI when it says:


Officials have been combing through the emails since Sunday night — using a program designed to find only the emails to and from Abedin within the time when Clinton was secretary of state. Agents will compare the latest batch of messages with those that have already been investigated to determine whether any classified information was sent from Clinton’s server.

This process will take some time, but officials tell NBC News that they hope that they will wrap up the winnowing process this week.

Since Sunday night?

Here’s how the FBI, using standard Unix tools, could have finished the “winnowing” in time for the Monday evening news cycle:

  1. Transform (if not already) all the emails into .eml format (to give you separate files for each email).
  2. Grep the resulting file set for emails that contain the Clinton email server by name or addess.
  3. Save the result of #2 to a file and copy all those messages to a separate directory.
  4. Extract the digital signature from each of the copied messages (see below), save to the Abedin file digital signature + file name where found.
  5. Extract the digital signatures from previously reviewed Clinton email server emails, save digital signatures only to the prior-Clinton-review file.
  6. Search for each digital signature in the Abedin file in the prior-Clinton-review file. If found, reviewed. If not found, new email.

The digital signatures are unique to each email and can therefore be used to dedupe or in this case, identify previously reviewed emails.

Here’s a DKIM example signature:

How can I read the DKIM header?

Here is an example DKIM signature (recorded as an RFC2822 header field) for the signed message:

DKIM-Signature a=rsa-sha1; q=dns;
d=example.com;
i=user@eng.example.com;
s=jun2005.eng; c=relaxed/simple;
t=1117574938; x=1118006938;
h=from:to:subject:date;
b=dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
av+yuU4zGeeruD00lszZVoG4ZHRNiYzR

Let’s take this piece by piece to see what it means. Each “tag” is associated with a value.

  • b = the actual digital signature of the contents (headers and body) of the mail message
  • bh = the body hash
  • d = the signing domain
  • s = the selector
  • v = the version
  • a = the signing algorithm
  • c = the canonicalization algorithm(s) for header and body
  • q = the default query method
  • l = the length of the canonicalized part of the body that has been signed
  • t = the signature timestamp
  • x = the expire time
  • h = the list of signed header fields, repeated for fields that occur multiple times

We can see from this email that:

  • The digital signature is dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
    av+yuU4zGeeruD00lszZVoG4ZHRNiYzR
    .
    This signature is matched with the one stored at the sender’s domain.
  • The body hash is not listed.
  • The signing domain is example.com.
    This is the domain that sent (and signed) the message.
  • The selector is jun2005.eng.
  • The version is not listed.
  • The signing algorithm is rsa-sha1.
    This is the algorith used to generate the signature.
  • The canonicalization algorithm(s) for header and body are relaxed/simple.
  • The default query method is DNS.
    This is the method used to look up the key on the signing domain.
  • The length of the canonicalized part of the body that has been signed is not listed.
    The signing domain can generate a key based on the entire body or only some portion of it. That portion would be listed here.
  • The signature timestamp is 1117574938.
    This is when it was signed.
  • The expire time is 1118006938.
    Because an already signed email can be reused to “fake” the signature, signatures are set to expire.
  • The list of signed header fields includes from:to:subject:date.
    This is the list of fields that have been “signed” to verify that they have not been modified.

From: What is DKIM? Everything You Need to Know About Digital Signatures by Geoff Phillips.

Altogether now, to eliminate previously reviewed emails we need only compare:

dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSbav+yuU4zGeeruD00lszZVoG4ZHRNiYzR (example, use digital signatures from Abedin file)

to the digital signatures in the prior-Clinton-review file.

Those that don’t match, are new files to review.

Why the news media hasn’t pressed the FBI on its extremely poor data processing performance is a mystery to me.

You?

PS: FBI field agents with data mining questions, I do off-your-books freelance consulting. Apologies but on-my-books for the tax man. If they don’t tell, neither will I.

October 31, 2016

Clinton/Podesta Emails 23 and 24, True or False? Cryptographically Speaking

Filed under: Data Mining,Hillary Clinton,News,Politics,Reporting — Patrick Durusau @ 10:21 am

Catching up on the Clinton/Podesta email releases from Wikileaks, via Michael Best, NatSecGeek. Michael bundles the releases up and posts them at: Podesta emails (zipped).

For anyone coming late to the game, DKIM “verified” means that the DKIM signature on an email is valid for that email.

In lay person’s terms, that email has been proven by cryptography to have originated from a particular mail server and when it left that mail server, it read exactly as it does now, i.e., no changes by Russians or others.

What I have created are files that lists the emails in the order they appear at Wikileaks, with the very next field being True or False on the verification issue.

Just because an email has “False” in the second column doesn’t mean it has been modified or falsified by the Russians.

DKIM signatures fail for all manner of reasons but when they pass, you have a guarantee the message is intact as sent.

For your research into these emails:

DKIM-complete-podesta-23.txt.gz

and

DKIM-complete-podesta-24.txt.gz.

For release 24, I did have to remove the DKIM signature on 39256 00010187.eml in order for the script to succeed. That is the only modification I made to either set of files.

October 30, 2016

Clinton/Podesta Emails – Towards A More Complete Graph (Part 3) New Dump!

Filed under: Data Mining,Graphs,Hillary Clinton,Politics — Patrick Durusau @ 8:07 pm

As you may recall from Clinton/Podesta Emails – Towards A More Complete Graph (Part 2), I didn’t check to see if “|” was in use as a separator in the extracted emails subject lines so when I tried to create node lists based on “|” as a separator, it failed.

That happens. More than many people are willing to admit.

In the meantime, a new dump of emails has arrived so I created the new DKIM-incomplete-podesta-1-22.txt.gz file. Which mean picking a new separator to use for the resulting file.

Advice: Check your proposed separator against the data file before using it. I forgot, you shouldn’t.

My new separator? |/|

Which I checked against the file to make sure there would be no conflicts.

The sed commands to remove < and > are the same as in Part 2.

Sigh, back to failure land, again.

Just as one sample:

awk 'FS="|/|" { print $7}' test.me

where test.me is:

9991 00013434.eml|/|False|/|2015-11-21 17:15:25-05:00|/|Eryn Sepp eryn.sepp@gmail.com|/|John Podesta john.podesta@gmail.com|/|Re: Nov 30 / Future Plans / Etc.!|/|8A6B3E93-DB21-4C0A-A548-DB343BD13A8C@gmail.com

returns:

Future Plans

I also checked that with gawk and nawk, with the same result.

For some unknown (to me) reason, all three are treating the first “/” in field 6 (by my count) as a separator, along with the second “/” in that field.

To test that theory, what do you think { print $8 } will return?

You’re right!

Etc.!|

So with the “|/|” separator, I’m going to have up to at least 9 fields, perhaps more, varying depending on whether “/” characters occur in the subject line.

🙁

That’s not going to work.

OK, so I toss the 10+ MB DKIM-complete-podesta-1-22.txt.gz into Emacs, whose regex treatment I trust, and change “|/|” to “@@@@@” and save that file as DKIM-complete-podesta-1-22-03.txt.

Another sanity check, which got us into all this trouble last time:

awk 'FS="@@@@@" { print $7}' podesta-1-22-03.txt | grep @ | wc -l

returns 36504, which plus the 16 files I culled as failures, equals 36520, the number of files in the Podesta 1-22 release.

Recall that all message-ids contain an @ sign to the correct answer on the number of files gives us confidence the file is ready for further processing.

Apologies for it taking this much prose to go so little a distance.

Our fields (numbered for reference) are:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Our first node for the node list (Clinton/Podesta Emails – Towards A More Complete Graph (Part 1)) was to capture the emails themselves.

Using Message-Id (field 7) as the identifier and Subject (field 6) as its label.

We are about to encounter another problem but let’s walk through it.

An example of what we are expecting:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg
@mail.gmail.com;”Re: Tomorrow”;

We have the Message-Id with a closing “;”, followed by the Subject, surrounded in double quote marks and also terminated by a “;”.

FYI: Mixing single and double quotes in awk is a real pain. I struggled with it but then was reminded I can declare variables:

-v dq='"'

which allows me to do this:

awk -v dq='"' 'FS="@@@@@" { print $7 ";" dq $6 dq ";"}' podesta-1-22-03.txt

The awk variable trick will save you considerable puzzling over escape sequences and the like.

Ah, now we are to the problem I mentioned above.

In the part 1 post I mentioned that while:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com;”Re: Tomorrow”;

works,

but having:

CAC9z1zL9vdT+9FN7ea96r+Jjf2=gy1+821u_g6VsVjr8U2eLEg
@mail.gmail.com;”Knox Knotes”;https://wikileaks.org/podesta-emails/emailid/9998;
CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com;”Re: Tomorrow”;https://wikileaks.org/podesta-emails/emailid/9999;

with Wikileaks links is more convenient for readers.

As you may recall, the last two lines read:

9998 00022160.eml@@@@@False@@@@@2015-06-23 23:01:55-05:00@@@@@Jerome Tatar jerry@TatarLawFirm.com@@@@@Jerome Tatar Jerome jerry@tatarlawfirm.com@@@@@Knox Knotes@@@@@CAC9z1zL9vdT+9FN7ea96r
+Jjf2=gy1+821u_g6VsVjr8U2eLEg@mail.gmail.com
9999 00013746.eml@@@@@False@@@@@2015-04-03 01:14:56-04:00@@@@@Eryn Sepp eryn.sepp@gmail.com@@@@@John Podesta john.podesta@gmail.com@@@@@Re: Tomorrow@@@@@CAKM1B-9+LQBXr7dgE0pKke7YhQC2dZ2akkgmSbRFGHUx-0NNPg@mail.gmail.com

Which means in addition to printing Message-Id and Subject as fields one and two, we need to split ID on the space and use the result to create the URL back to Wikileaks.

It’s late so I am going to leave you with DKIM-incomplete-podesta-1-22.txt.gz. This is complete save for 16 files that failed to parse. Will repost tomorrow with those included.

I have the first node file script working and that will form the basis for the creation of the edge lists.

PS: Look forward to running awk files tomorrow. It makes a number of things easier.

Exploding the Dakota Access Pipeline Target List

Filed under: #DAPL,Data Mining,Government,Politics — Patrick Durusau @ 2:05 pm

Who Is Funding the Dakota Access Pipeline? Bank of America, HSBC, UBS, Goldman Sachs, Wells Fargo by Amy Goodman and Juan Gonz´lez.

Great discussion of the funding behind the North Dakota Pipeline project!

They point to two important graphics to share, the first from: Who’s Banking on the Dakota Access Pipeline?:

view this map on LittleSis

Thanks for the easy embedding! (Best viewed at view this map on LittleSis)

And, from Who’s Banking on the Dakota Access Pipeline? (Food & Water Watch):

1608_bakkenplans-bakken-banking-fina-460

The full scale Food & Water Watch image.

Both visualizations are great ways to see some of the culprits responsible for the Dakota Access Pipeline, but not all.

Tracking the funding from Bakken Dakota Access Pipeline back to, among others, Citibank, Credit Agricole, ING Bank, and Natixis, should be a clue as to the next step.

All of the sources of financing, Citibank, Credit Agricole, ING Bank, Natixis, etc., are owned, one way or another, by investors. Moreover, as abstract entities, they cannot act without the use of agents, both as staff and as contractors.

If you take the financing entities as nodes (the first visualization), those should explode into both investor/owners and staff/agents, who do their bidding.

Thinking Citibank, for example, is too large and diffuse a target for effective political, social or economic pressure, but the smaller the part, the greater chance there is to have influence.

It’s true some nation states might be able to call Citibank to heel and if you can whistle one up, give it a shot. But while waiting on you to make you move, the rest of us should be looking for targets more within our reach.

That lesson, the one of the financiers exploding into more manageable targets (don’t overlook their political allies and their extended target lists), the same is true for the staffs and agents of Sunoco Logistics, Energy Transfer Partner, Energy Equity Transfer (a sister partnership to Energy Transfer Pipeline), and Bakken Dakota Access Pipeline.

I have yet to see an abstract entity drive a bulldozer, move pipe, etc. Despite the popular fiction that a corporation is a person, it’s somebody on the ground violating the earth, poisoning the water, disturbing sacred ground, all for the benefit of some other natural person.

Corporations, being abstract entities, cannot feel pressure. Their staffs and contractors, on the other hand, don’t have that immunity.

It will be post-election in the US, but I’m interested in demonstrating and assisting in demonstrating, how to explode these target lists in both directions.

As always, your comments and suggestions are most welcome.

PS: One source for picking up names and entities would be: Bakken.com, which self-describes as:

Bakken.com is owned and operated by Energy Media Group which is based in Fargo North Dakota. Bakken.com was brought to life to fill a gap in the way that news was brought to the people on this specific energy niche. We wanted to have a place where people could come for all the news related to the Bakken Shale. Energy Media group owns several hundred websites, most of which surround shale formations all over the world.Our sister site Marcellus.com went live at the beginning of December and already has a huge following. In the coming months, Bakken.com will be a place not only for news, but for jobs and classifieds as well. Thank-you for visiting us today, and make sure to sign up for our news letter to get the latest updates and also Like us on Facebook.

To give you the full flavor of their coverage: Oil pipeline protester accused of terrorizing officer:

North Dakota authorities have issued an arrest warrant accusing a pipeline protester on horseback of charging at a police officer.

Mason Redwing, of Fort Thompson, South Dakota, is wanted on felony charges of terrorizing and reckless endangerment in the Sept. 28 incident near St. Anthony. He’s also wanted on a previous warrant for criminal trespass.

The Morton County Sheriff’s Office says the officer shouted a warning and pointed a shotgun loaded with non-lethal beanbag rounds to defuse the situation.

I’m on a horse and you have a shotgun. Who is it that is being terrorized?

Yeah, it’s that kind of reporting.

October 29, 2016

Digital Redlining At Facebook

Filed under: Government,Mapping,Maps,Politics — Patrick Durusau @ 4:46 pm

“Redlining” has gone digital.

Facebook Lets Advertisers Exclude Users by Race by Julia Angwin and Terry Parris Jr. illustrates my point that improved technology isn’t making us better people, it’s enabling our bigotry to be practiced in new and more efficient ways.

Julia and Parris write:

Imagine if, during the Jim Crow era, a newspaper offered advertisers the option of placing ads only in copies that went to white readers.

That’s basically what Facebook is doing nowadays.

The ubiquitous social network not only allows advertisers to target users by their interests or background, it also gives advertisers the ability to exclude specific groups it calls “Ethnic Affinities.” Ads that exclude people based on race, gender and other sensitive factors are prohibited by federal law in housing and employment.

It’s a great read and Facebook points out that it wags its policy finger use of:

…the targeting options for discrimination, harassment, disparagement or predatory advertising practices.

“We take a strong stand against advertisers misusing our platform: Our policies prohibit using our targeting options to discriminate, and they require compliance with the law,” said Steve Satterfield, privacy and public policy manager at Facebook. “We take prompt enforcement action when we determine that ads violate our policies.”

Bigots near and far are shaking in their boots, just thinking about the policy finger of Facebook.

In discussion of this modernized form of “redlining,” it may be helpful to know the origin of the term and its impact on society.

Here’s a handy synopsis of the practice:


The FHA also explicitly practiced a policy of “redlining” when determining which neighborhoods to approve mortgages in. Redlining is the practice of denying or limiting financial services to certain neighborhoods based on racial or ethnic composition without regard to the residents’ qualifications or creditworthiness. The term “redlining” refers to the practice of using a red line on a map to delineate the area where financial institutions would not invest (see residential security maps).

The FHA allowed personal and agency bias in favor of all white suburban subdivisions to affect the kinds of loans it guaranteed, as applicants in these subdivisions were generally considered better credit risks. In fact, according to James Loewen in his 2006 book Sundown Towns, FHA publications implied that different races should not share neighborhoods, and repeatedly listed neighborhood characteristics like “inharmonious racial or nationality groups” alongside such noxious disseminates as “smoke, odors, and fog.” One example of the harm done by the FHA is as follows:

In the late 1930’s, as Detroit grew outward, white families began to settle near a black enclave adjacent to Eight Mile Road. By 1940, the blacks were surrounded, but neither they nor the whites could get FHA insurance because of the proximity of an inharmonious racial group. So, in 1941, an enterprising white developer built a concrete wall between the white and black areas. The FHA appraisers then took another look and approved the mortgages on the white properties.

Yes, segregated housing was due in part to official U.S. (not Southern) government policies.

I live near Atlanta, GA. so here’s a portion of an actual “redlining” map:

atlanta-redline-460

You can see the full version here.

Racially segregated housing wasn’t a matter of chance or birds of a feather, it was official government policy. Public government policy. They lacked the moral sensitivity to be ashamed of their actions.

There are legitimate targeting ad decisions.

Showing me golf club ads is a lost cause. 😉 As with a number of similar items.

But when does race become a legitimate exclusion category? And for what products?

For more historical data on the Home Owners’ Loan Corporation and a multitude of maps, see: Digital HOLC Maps by LaDale Winling. You may also enjoy his main site: Urban Oasis.

Just so you know, redlining isn’t a racist practice of the distant past. Redlining, a/k/a, housing discrimination, is alive and well today.

Does a 50% discrimination rate in Boston (Mass.) sound like it remains a problem?

PS: New Clinton/Podesta posts are coming! I’m posting while my scripts run in the background. New 3.5 GB dump.

October 26, 2016

No Frills Gephi (8.2) Import of Clinton/Podesta Emails (1-18)

Filed under: Gephi,Graphs,Hillary Clinton,Neo4j,Networks,Politics,Visualization — Patrick Durusau @ 7:19 pm

Using Gephi 8.2, you can create graphs of the Clinton/Podesta emails based on terms in subject lines or the body of the emails. You can interactively work with all 30K+ (as of today) emails and extract networks based on terms in the posts. No programming required. (Networks based on terms will appear tomorrow.)

If you have Gephi 8.2 (I can’t find the import spigot in 9.0 or 9.1), you can import the Clinton/Podesta Emails (1-18) for analysis as a network.

To save you the trouble of regressing to Gephi 8.2, I performed a no frills/default import and exported that file as podesta-1-18-network.gephi.gz.

Download and uncompress podesta-1-18-network.gephi.gz, then you can pickup at timemark 3.49.

Open the file (your location may differ):

gephi-podesta-open-460

Obligatory hair-ball graph visualization. 😉

gephi-first-look-460

Considerably less appealing that Jennifer Golbeck’s but be patient!

First step, Layout -> Yifan Hu. My results:

yifan-hu-first-460

Second step, Network Diameter statistics (right side, run).

No visible impact on the graph but, now you can change the color and size of nodes in the graph. That is they have attributes on which you can base the assignment of color and size.

Tutorial gotcha: Not one of Jennifer’s tutorials but I was watching a Gephi tutorial that skipped the part about running statistics on the graph prior to assignment of color and size. Or I just didn’t hear it. The menu options appear in documentation but you can’t access them unless and until you run network statistics or have attributes for the assignment of color and size. Run statistics first!

Next, assign colors based on betweenness centrality:

gephi-betweenness-460

The densest node is John Podesta, but if you remove his node, rerun the network statistics and re-layout the graph, here is part of what results:

delete-central-node-460

A no frills import of 31,819 emails results in a graph of 3235 nodes and 11,831 edges.

That’s because nodes and edges combine (merge to you topic map readers) when they have the same identifier or for edges are between the same nodes.

Subject to correction, when that combining/merging occurs, the properties on the respective nodes/edges are accumulated.

Topic mappers already realize there are important subjects missing, some 31,819 of them. That is the emails themselves don’t by default appear as nodes in the network.

Ian Robinson, Jim Webber & Emil Eifrem illustrate this lossy modeling in Graph Databases this way:

graph-databases-lossy-460

Modeling emails without the emails is rather lossy. 😉

Other nodes/subjects we might want:

  • Multiple to: emails – Is who was also addressed important?
  • Multiple cc: emails – Same question as with to:.
  • Date sent as properties? So evolution of network/emails can be modeled.
  • Capture “reply-to” for relationships between emails?

Other modeling concerns?

Bear in mind that we can suppress a large amount of the detail so you can interactively explore the graph and only zoom into/display data after finding interesting patterns.

Some helpful links:

https://archive.org/details/PodestaEmailszipped: The email collection as bulk download, thanks to Michael Best, @NatSecGeek.

https://github.com/gephi/gephi/releases: Where you can grab a copy of Gephi 8.2.

Spying On Government Oppression: Est. Oct. 25 – Nov. 4, 2016 – #NoDAPL camps (North Dakota)

Filed under: Government,Maps,Politics — Patrick Durusau @ 1:02 pm

Police forces have suborned the FAA into declaring a no-fly zone in a seven (7) mile radius of #NoDAPL camps in North Dakota.

Truthful video of unprovoked violence against peaceful protesters may “interfere with the election,” and/or their continuance of this cultural/environment/social outrage.

The FAA posted this helpful map:

north-dakota-large-460

James Peach reports contact information in FAA Issues “No Fly Zone” Over Area of DAPL Protests at Standing Rock, ND:

FAA Regional Office
Contact: Laurie Suttmeier
Telephone: (701)-667-3224

“No Fly Zone” Notice –
FAA Temporary Flight Restrictions

Morton Country Sheriff’s Department
Facebook Page: Morton County Sheriff’s Department on FB
Contact: Sheriff Kyle Kirchmeier
Telephone: (701) 667-3330

In case you are behind on this particular government crime against a sovereign people and its own citizens, catch up with: The fight over the Dakota Access Pipeline, explained by Brad Plumer or, Life in the Native American oil protest camps (BBC).

The deployment of government forces is fluid so drones are essential to effective resistance.

And to capture and stream government atrocities in real time.

Small wonder the FAA is a co-conspirator in this criminal enterprise.

October 24, 2016

Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

Filed under: Data Mining,Hillary Clinton,Politics — Patrick Durusau @ 9:05 pm

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.

#!/usr/bin/python

import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data = f.read()
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")

output.close()

Download podesta-test.tar.gz, unpack that to a directory and then save/uppack test-clinton-script-24Oct2016.py.gz to the same directory, then:

python test-clinton-script-24Oct2016.py

Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

October 23, 2016

Political Noise Data (Tweets From 3rd 2016 Presidential Debate)

Filed under: Government,Politics,Tweets,Twitter — Patrick Durusau @ 12:42 pm

Chris Albon has collected data on 885,222 debate tweets from the third Presidential Debate of 2016.

As you can see from the transcript, it wasn’t a “debate” in any meaningful sense of the term.

The quality of tweets about that debate are equally questionable.

However, the people behind those tweets vote, buy products, click on ads, etc., so despite my title description as “political noise data,” it is important political noise data.

To conform to Twitter terms of service, Chris provides the relevant tweet ids and a script to enable construction of your own data set.

BTW, Chris includes his Twitter mining scripts.

Enjoy!

October 19, 2016

#Truth2016 – The year when truth “interfered” with a democratic election.

Filed under: Government,Politics — Patrick Durusau @ 3:27 pm

Unless you have been in solitary confinement or a medically induced coma for the last several weeks, you are aware that Wikileaks has been accused of “interfering” with the 2016 US presidential election.

The crux of that complaint is the release by Wikileaks of a series of emails collectively known as the Podesta Emails, which are centered on the antics of Hillary Clinton and her crew as she runs for the presidency.

The untrustworthy who made these accusations include the Department of Homeland Security and the Office of the Director of National Intelligence on Election Security. In a no facts revealed statement: Joint Statement from the Department Of Homeland Security and Office of the Director of National Intelligence on Election Security, the claim of interference is made but not substantiated.

The cry of “interference” has been taken up by an uncritical media and echoed by President Barack Obama.

There’s just one problem.

We know who was sent the emails in question and despite fanciful casting of doubt on their accuracy, out of hundred of participants, not one, nary one, has stepped forward with an original email to prove these are false.

Simple enough to ask some third-party expert to retrieve the emails in question from a server and then to compare to the Wikileaks releases.

But I have heard of no moves in that direction.

Have you?

The crux of the current line by the US government is that truthful documents may influence the coming presidential election. In a direction they don’t like.

Think about that for a moment: Truthful documents (in the sense of accuracy) interfering with a democratic election.

That makes me wonder what definition of “democratic” that Clinton, Obama and the media must share?

Not anything I would recognize as a democracy. You?

October 17, 2016

How To Read: “War Goes Viral” (with caution, propaganda ahead)

Filed under: Politics,Social Media — Patrick Durusau @ 7:14 pm

social-media-war-460

War Goes Viral – How social media is being weaponized across the world by Emerson T. Brooking and P. W. Singer.

One of the highlights of the post reads:


Perhaps the greatest danger in this dynamic is that, although information that goes viral holds unquestionable power, it bears no special claim to truth or accuracy. Homophily all but ensures that. A multi-university study of five years of Facebook activity, titled “The Spreading of Misinformation Online,” was recently published in Proceedings of the National Academy of Sciences. Its authors found that the likelihood of someone believing and sharing a story was determined by its coherence with their prior beliefs and the number of their friends who had already shared it—not any inherent quality of the story itself. Stories didn’t start new conversations so much as echo preexisting beliefs.

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Ooooh, “…’truth’ becomes a matter of emotional resonance.”

That is always true but give the authors their due, “War Goes Viral” is a masterful piece of propaganda to the contrary.

Calling something “propaganda,” or “media bias” is easy and commonplace.

Let’s do the hard part and illustrate why that is the case with “War Goes Viral.”

The tag line:

How social media is being weaponized across the world

preps us to think:

Someone or some group is weaponizing social media.

So before even starting the article proper, we are prepared to be on the look out for the “bad guys.”

The authors are happy to oblige with #AllEyesOnISIS, first paragraph, second sentence. “The self-styled Islamic State…” appears in the second paragraph and ISIS in the third paragraph. Not much doubt who the “bad guys” are at this point in the article.

Listing only each change of current actors, “bad guys” in red, the article from start to finish names:

  • Islamic State
  • Russia
  • Venezuela
  • China
  • U.S. Army training to combat “bad guys”
  • Israel – neutral
  • Islamic State (Hussain)

The authors leave you with little doubt who they see as the “bad guys,” a one-sided view of propaganda and social media in particular.

For example, there is:

No mention of Voice of American (VOA), perhaps one of the longest running, continuous disinformation campaigns in history.

No mention of Pentagon admits funding online propaganda war against Isis.

No mention of any number of similar projects and programs which weren’t constructed with an eye on “truth and accuracy” by the United States.

The treatment here is as one-sided as the “weaponized” social media of which the authors complain.

Not that the authors are lacking in skill. They piggyback their own slant onto The Spreading of Misinformation Online:


This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

How much of that is supported by The Spreading of Misinformation Online?

  • First sentence
  • Second sentence
  • Both sentences

The answer is:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.”

The remainder of that paragraph was invented out of whole clothe by the authors and positioned with “truth” in quotes to piggyback on the legitimate academic work just quoted.

As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Is popular cant among media and academic types but no more than that.

Skilled reporting can put information in a broad context and weave a coherent narrative, but disparaging social media authors doesn’t make that any more likely.

“War Goes Viral” being a case in point.

October 16, 2016

Why I Distrust US Intelligence Experts, Let Me Count the Ways

Filed under: Government,Intelligence,Politics — Patrick Durusau @ 8:42 pm

Some US Intelligence failures, oldest to most recent:

  1. Pearl Harbor
  2. The Bay of Pigs Invasion
  3. Cuban Missile Crisis
  4. Vietnam
  5. Tet Offensive
  6. Yom Kippur War
  7. Iranian Revolution
  8. Soviet Invasion of Afghanistan
  9. Collapse of the Soviet Union
  10. Indian Nuclear Test
  11. 9/11 Attacks
  12. Iraq War (WMDs)
  13. Invasion of Afghanistan (US)
  14. Israeli moles in US intelligence, various dates

Those are just a few of the failures of US intelligence, some of which cost hundreds of thousands if not millions of lives.

Yet, you can read today: Trump’s refusal to accept intelligence briefing on Russia stuns experts.

There are only three reasons I can think of to accept findings by the US intelligence community:

  1. You are on their payroll and for that to continue, well, you know.
  2. As a member of the media, future tips/leaks depends upon your acceptance of current leaks. Anyone who mocks intelligence service lies is cut off from future lies.
  3. As a politician, the intelligence findings discredit facts unfavorable to you.

For completeness sake, I should mention that intelligence “experts” could be telling the truth but given their track record, it is an edge case.

Before repeating the mindless cant of “the Russians are interfering with the US election,” stop to ask your sources, “…based on what?” Opinions of all the members of the US intelligence community = one opinion. Ask for facts. No facts offered, report that instead of the common “opinion.”

October 13, 2016

Obama on Fixing Government with Technology (sigh)

Filed under: Government,Politics,Project Management — Patrick Durusau @ 4:27 pm

Obama on Fixing Government with Technology by Caitlin Fairchild.

Like any true technology cultist, President Obama mentions technology, inefficiency, but never the people who make up government as the source of government “problems.” Nor does he appear to realize that technology cannot fix the people who make up government.

Those out-dated information systems he alludes to were built and are maintained under contract with vendors. Systems that are used by users who are accustomed to those systems and will resist changing to others. Still other systems rely upon those systems being as they are in terms of work flow. And so on. At its very core, the problem of government isn’t technology.

It’s the twin requirement that it be composed of and supplied by people, all of who have a vested interest and comfort level with the technology they use and, don’t forget, government has to operate 24/7, 365 days out of the year.

There is no time to take down part of the government to develop new technology, train users in its use and at the same time, run all the current systems which are, to some degree, meeting current requirements.

As an antidote to the technology cultism that infects President Obama and his administration, consider reading Geek Heresy, the description of which reads:

In 2004, Kentaro Toyama, an award-winning computer scientist, moved to India to start a new research group for Microsoft. Its mission: to explore novel technological solutions to the world’s persistent social problems. Together with his team, he invented electronic devices for under-resourced urban schools and developed digital platforms for remote agrarian communities. But after a decade of designing technologies for humanitarian causes, Toyama concluded that no technology, however dazzling, could cause social change on its own.

Technologists and policy-makers love to boast about modern innovation, and in their excitement, they exuberantly tout technology’s boon to society. But what have our gadgets actually accomplished? Over the last four decades, America saw an explosion of new technologies – from the Internet to the iPhone, from Google to Facebook – but in that same period, the rate of poverty stagnated at a stubborn 13%, only to rise in the recent recession. So, a golden age of innovation in the world’s most advanced country did nothing for our most prominent social ill.

Toyama’s warning resounds: Don’t believe the hype! Technology is never the main driver of social progress. Geek Heresy inoculates us against the glib rhetoric of tech utopians by revealing that technology is only an amplifier of human conditions. By telling the moving stories of extraordinary people like Patrick Awuah, a Microsoft millionaire who left his lucrative engineering job to open Ghana’s first liberal arts university, and Tara Sreenivasa, a graduate of a remarkable South Indian school that takes children from dollar-a-day families into the high-tech offices of Goldman Sachs and Mercedes-Benz, Toyama shows that even in a world steeped in technology, social challenges are best met with deeply social solutions.

Government is a social problem and to reach for a technology fix first, is a guarantee of yet another government failure.

October 8, 2016

Chasing File Names – Check My Work

Filed under: Government,Hillary Clinton,Politics — Patrick Durusau @ 9:04 pm

I encountered a stream of tweets of which the following are typical:

guccifer2-0-tweets-cf-7z-460

Hmmm, is cf.7z a different set of files from ebd-cf.7z?

You could “eye-ball” the directory listings but that is tedious and error-prone.

Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.

Running

uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l

(“-d” for duplicate lines) on the resulting file, piping it into wc -l, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)

Running

uniq -u cf-7z-and-ebd-cf-files-Sorted.txt

(“-u” for unique lines), will give you no return (no unique lines).

With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.

BTW, do you think a similar technique could be applied to spreadsheets?

October 7, 2016

The “Fact Free” U.S. Intelligence Community (USIC)

Filed under: Cybersecurity,Government,Politics — Patrick Durusau @ 7:49 pm

The Joint Statement from the Department of Homeland Security and Office of the Director of National Intelligence on Election Security is a reminder of why the U.S. Intelligence Community (USIC) fails so very often.

The first paragraph:

The U.S. Intelligence Community (USIC) is confident that the Russian Government directed the recent compromises of e-mails from US persons and institutions, including from US political organizations. The recent disclosures of alleged hacked e-mails on sites like DCLeaks.com and WikiLeaks and by the Guccifer 2.0 online persona are consistent with the methods and motivations of Russian-directed efforts. These thefts and disclosures are intended to interfere with the US election process. Such activity is not new to Moscow—the Russians have used similar tactics and techniques across Europe and Eurasia, for example, to influence public opinion there. We believe, based on the scope and sensitivity of these efforts, that only Russia’s senior-most officials could have authorized these activities.

Do you see any facts in that first paragraph?

I see the conclusion “…are consistent with the methods and motivations of Russian-directed efforts,” but no facts to back that statement up.

Moreover, the second paragraph leaps for the “smoking gun” with:

Some states have also recently seen scanning and probing of their election-related systems, which in most cases originated from servers operated by a Russian company….

You would hope the U.S. Intelligence Community (USIC) would have heard of VPNs (virtual private networks).

No facts, just allegations that favor one party in the fast approaching U.S. presidential election.

Yes, intelligence agencies are interfering with the U.S. election, but its not Russian intelligence agencies.

DNC/DCCC/CF Excel Files, As Of October 7, 2016

Filed under: Cybersecurity,Excel,Government,Hillary Clinton,Politics — Patrick Durusau @ 4:36 pm

A continuation of my post Avoiding Viruses in DNC/DCCC/CF Excel Files.

Where Avoiding Viruses… focused on avoiding the hazards and dangers of Excel-born viruses, this post focuses on preparing the DNC/DCCC/CF Excel files from Guccifer 2.0, as of October 7, 2016, for further analysis.

As I mentioned before, you could search through all 517 files to date, separately, using Excel. That thought doesn’t bring me any joy. You?

Instead, I’m proposing that we prepare the files to be concatenated together, resulting in one fairly large file, which we can then search and manipulate as one entity.

As a data cleanliness task, I prefer to prefix every line in every csv export, with the name of its original file. That will enable us to extract lines that mention the same person over several files and still have a bread crumb trail back to the original files.

Munging all the files together without such a step, would leave us either grepping across the collection and/or using some other search mechanism. Why not plan on avoiding that hassle?

Given the number of files requiring prefixing, I suggest the following:

for f in *.csv*; do
sed -i "s/^/$f,/" $f
done

This shell script uses sed with the -i switch, which means sed changes files in place (think overwriting specified part). Here the s/ means to substitute at the ^, start of each line, $f, the filename plus a comma separator and the final $f, is the list of files to be processed.

There are any number of ways to accomplish this task. Your community may use a different approach.

The result of my efforts is: guccifer2.0-all-spreadsheets-07October2016.gz, which weighs in at 61 MB compressed and 231 MB uncompressed.

I did check and despite having variable row lengths, it does load in my oldish version of gnumeric. All 1030828 lines.

That’s not all surprising for gnumeric, considering I’m running 24 GB of physical RAM. Your performance may vary. (It did hesitate loading it.)

There is much left to be done, such as deciding what padding is needed to even out all the rows. (I have ideas, suggestions?)

Tools to manipulate the CSV. I have a couple of stand-bys and a new one that I located while writing this post.

And, of course, once the CSV is cleaned up, what other means can we use to explore the data?

My focus will be on free and high performance (amazing how often those are found together Larry Ellison) tools that can be easily used for exploring vast seas of spreadsheet data.

Next post on these Excel files, Monday, October 10, 2016.


I am downloading the cf.7z Guccifer 2.0 drop as I write this update.

Watch for updates on the comprehensive file list and Excel files next Monday. October 8, 2016, 01:04 UTC.

Avoiding Viruses in DNC/DCCC/CF Excel Files

Filed under: Cybersecurity,Excel,Government,Hillary Clinton,Politics — Patrick Durusau @ 4:36 pm

I hope you haven’t opened any of the DNC/DCCC/CF Excel files outside of a VM. 517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

Yes?

Files from trusted sources can contain viruses. Files from unknown or rogue sources even more so. However tempting (and easy) it is to open up alleged purloined files on your desktop, minimal security conscious users will resist the temptation.

Warning: I did NOT scan the Excel files for viruses. The best way to avoid Excel viruses is to NOT open Excel files.

I used ssconvert, one of the utilities included with gnumeric to bulk convert the Excel files to csv format. (Comma Separate Values is documents in RFC 4780.

Tip: If you are looking for a high performance spreadsheet application, take a look at gnumeric.

Ssconvert relies on file extensions (although other options are available) so I started with:

ssconvert -S donors.xlsx donors.csv

The -S option takes care of workbooks with multiple worksheets. You need a later version of ssconvert (mine is 1.12.9-1 (2013) and the current version of gnumeric and ssconvert is 1.12.31 (August 2016), to convert the .xlsx files without warning.

I’m upgrading to Ubuntu 16.04 soon so it wasn’t worth the trouble trying to stuff a later version of gnumeric/ssconvert onto my present Ubuntu 14.04.

Despite the errors, the conversion appears to have worked properly:

donors-01-460

to its csv output:

donor-03-460

I don’t see any problems.

I’m checking a sampling of the other conversions as well.

BTW, do notice the confirmation of reports from some commentators that they contacted donors who confirmed donating, but could not recall the amounts.

Could be true. If you pay protection money often enough, I’m sure it’s hard to recall a specific payment.

Sorry, I got distracted.

So, only 516 files to go.

I don’t recommend you do:

ssconvert -S filename.xlsx filename.csv

516 times. That will be tedious and error prone.

At least for Linux, I recommend:

for f in *.xls*; do
   ssconvert -S $f $f.csv
done

The *.xls* captures both .xsl and .xslx files, then invokes ssconvert -S on the file and then saves the output file with the original name, plus the extension .csv.

The wc -l command reports 1030828 lines in the consolidated csv file for these spreadsheets.

That’s a lot of lines!

I have some suggestions on processing that file, see: DNC/DCCC/CF Excel Files, As Of October 7, 2016.

October 6, 2016

517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

Filed under: Government,Hillary Clinton,Politics — Patrick Durusau @ 8:36 pm

As of today, the data dumps by Guccifer2.0 have contained 517 Excel files.

The vehemence of posts dismissing this dumps makes me wonder two things:

  1. How many of the Excel files these commentators have reviewed?
  2. What is it that you might find in them that worries them so?

I don’t know the answer to #1 and I won’t speculate on their diligence in examining these files. You can reach your own conclusions in that regard.

Nor can I give you an answer to #2, but I may be able to help you explore these spreadsheets.

The old fashioned way, opening each file, at one Excel file per minute, assuming normal Office performance, ;-), would take longer than an eight-hour day to open them all.

You still must understand and compare the spreadsheets.

To make 517 Excel files more than a number, here’s a list of all the Guccifer2.0 released Excel files as of today: guccifer2.0-excel-files-sorted.txt.

(I do have an unfair advantage in that I am willing to share the files I generate, enabling you to check my statements for yourself. A personal preference for fact-based pleading as opposed to conclusory hand waving.)

If you think of each line in the spreadsheets as a record, this sounds like a record linkage problem. Except they have no uniform number of fields, headers, etc.

With record linkage, we would munge all the records into a single record format and then and only then, match up records to see which ones have data about the same subjects.

Thinking about that, the number 517 looms large because all the formats must be reconciled to one master format, before we start getting useful comparisons.

I think we can do better than that.

First step, let’s consider how to create a master record set that keeps all the data as it exists now in the spreadsheets, but as a single file.

See you tomorrow!

October 5, 2016

Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files)

Filed under: Cybersecurity,Government,Hillary Clinton,Politics — Patrick Durusau @ 8:25 pm

However amusing the headline ‘Guccifer 2.0’ Is Bullshitting Us About His Alleged Clinton Foundation Hack may be, Lorenzo Fanchschi-Bicchierai offers no factual evidence to support his claim,

… the hacker’s latest alleged feat appears to be a complete lie.

Or should I say that:

  • Clinton Foundation denies it has been hacked
  • The Hill whines about who is a donor where
  • The Daily Caller says, “nothing to see here, move along, move along”

hardly qualifies as anything I would rely on.

Checking the file names is one rough check for duplication.

First, you need a set of the file names for all the releases on Guccifer 2.0’s blog:

Relying on file names alone is iffy as the same “content” can be in files with different names, or different content in files with the same name. But this is a rough cut against thousands of documents, so file names it is.

So you can check my work, I saved a copy of the files listed at the blog in date order: guccifer2.0-File-List-By-Blog-Date.txt..

For combining files for use with uniq, you will need a sorted, uniq version of that file: guccifer2.0-File-List-Blog-Sorted-Uniq-lc-final.txt.

Next, there was a major dump of files under the file name 7dc58-ngp-van.7z, approximately 820 MB of files. (Not listed on the blog but from Guccifer 2.0.)

You can use your favorite tool set or grab a copy of: 7dc58-ngp-van-Sorted-Uniq-lc-final.txt.

You need to combine those file names with those from the blog to get a starting set of names for comparison against the alleged Clinton Foundation hack.

Combining those two file name lists together, sorting them and creating a unique list of file names results in: guccifer2.0-30Sept2016-Sorted-Unique.txt.

Follow the same process for ebd-cf.7z, the file that dropped on the 3rd of October 2016. Or grab: ebd-cf-file-Sorted-Uniq-lc-final.txt.

Next, combine guccifer2.0-30Sept2016-Sorted-Unique.txt (the files we knew about before the 3rd of October) with ebd-cf-file-Sorted-Uniq.txt, and sort those file names, resulting in: guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt.

The final step is to apply uniq -d to guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt, which should give you the duplicate files, comparing the files in ebd-cf.7z to those known before September 30, 2016.

The results?

11-26-08 nfc members raised.xlsx
db1.mdb
donorsbymm.xlsx
donorsbymm_2.xlsx
netdem03-02.xlsx
thumbs.db
viewfecfiling.xls

Seven files out of 2085 doesn’t sound like a high degree of duplication.

At least not to me.

You?

PS: On the allegations about the Russians, you could ask the Communists in the State Department or try the Army General Staff. 😉 Some of McCarty’s records are opening up if you need leads.

PPS: Use the final sorted, unique file list to check future releases by Guccifer 2.0. It might help you avoid bullshitting the public.

October 4, 2016

#Guccifer 2.0 Drop – Oct. 4, 2016 – File List

Filed under: Cybersecurity,Government,Politics — Patrick Durusau @ 9:37 pm

While you wait for your copy of the October 4, 2016 drop by #Guccifer 2.0 to download, you may want to peruse the file list for that drop: ebd-cf-file-list.gz.

A good starting place for comments on this drop is: Guccifer 2.0 posts DCCC docs, says they’re from Clinton Foundation – Files appear to be from Democratic Congressional Campaign Committee and DNC hacks. by Sean Gallagher.

The paragraph in Sean’s post that I find the most interesting is:


However, a review by Ars found that the files are clearly not from the Clinton Foundation. While some of the individual files contain real data, much of it came from other breaches Guccifer 2.0 has claimed credit for at the Democratic National Committee and the Democratic Congressional Campaign Committee—hacks that researchers and officials have tied to “threat groups” connected to the Russian Government. Other data could have been aggregated from public information, while some appears to be fabricated as propaganda.

To verify Sean’s claim of duplication, compare the file names in this dump against those from prior dumps.

Sean is not specific about which files/data are alleged to be “fabricated as propaganda.”

I continue to be amused by allegations of Russian Government involvement. When seeking funding, Russian (substitute other nationalities) possess super-human hacking capabilities. Yet, in cases like this one, which regurgitates old data, Russian Government involvement is presumed.

The inconsistency between Russian Government super-hackers and Russian Government copy-n-paste data leaks, doesn’t seem to be getting much play in the media.

Perhaps you can help on that score.

Enjoy!

October 3, 2016

“Just the texts, Ma’am, just the texts” – Colin Powell Emails Sans Attachments

Filed under: Colin Powell Emails,Government,Politics,Uncategorized — Patrick Durusau @ 7:55 pm

As I reported in Bulk Access to the Colin Powell Emails – Update, I was looking for a host for the complete Colin Powell emails at 2.5 GB, but I failed on that score.

I can’t say if that result is lack of interest in making the full emails easily available or if I didn’t ask the right people. Please circulate my request when you have time.

In the meantime, I have been jumping from one “easy” solution to another, most of which involved parsing the .eml files.

But my requirement is to separate the attachment from the emails, quickly and easily. Not to parse the .eml files in preparation for further process.

How does a 22 character, command line sed expression sound?

Do you know of an “easier” solution?

sed -i '/base64/,$d' *

Reasoning the first attachment (in the event of multiple attachments) will include the string “base64” so I pass a range expression that starts there and ends at the end of the message “$” and delete that pattern, d, and write the files in place “-i.”

There are far more sophisticated solutions to this problem but as crude as this may be, I have reduced the 2.5 GB archive file that includes all the emails and their attachments down to 63 megabytes.

Attachments are important too but my first steps were to make these and similar files more accessible.

Obtaining > 29K files through the drinking straw at DCLeaks or waiting until I find a host for a consolidated 2.5 GB files, doesn’t make these files more accessible.

A 63 MB download of the Colin Powells Emails With No Attachments may.

Please feel free to mirror these files.

PS: One oddity I noticed in testing the download. With Chrome, the file size inflates to 294MB. With Mozilla, the file size is 65MB. ? Both unpack properly. Suggestions?

PPS: More sophisticated processing of the raw emails and other post-processing to follow.

September 28, 2016

Election Prediction and STEM [Concealment of Bias]

Filed under: Bias,Government,Politics,Prediction — Patrick Durusau @ 8:21 pm

Election Prediction and STEM by Sheldon H. Jacobson.

From the post:

Every U.S. presidential election attracts the world’s attention, and this year’s election will be no exception. The decision between the two major party candidates, Hillary Clinton and Donald Trump, is challenging for a number of voters; this choice is resulting in third-party candidates like Gary Johnson and Jill Stein collectively drawing double-digit support in some polls. Given the plethora of news stories about both Clinton and Trump, November 8 cannot come soon enough for many.

In the Age of Analytics, numerous websites exist to interpret and analyze the stream of data that floods the airwaves and newswires. Seemingly contradictory data challenges even the most seasoned analysts and pundits. Many of these websites also employ political spin and engender subtle or not-so-subtle political biases that, in some cases, color the interpretation of data to the left or right.

Undergraduate computer science students at the University of Illinois at Urbana-Champaign manage Election Analytics, a nonpartisan, easy-to-use website for anyone seeking an unbiased interpretation of polling data. Launched in 2008, the site fills voids in the national election forecasting landscape.

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos. The methodologies used by Election Analytics include Bayesian statistics, which estimate the posterior distributions of the true proportion of voters that will vote for each candidate in each state, given both the available polling data and the states’ previous election results. Each poll is weighted based on its age and its size, providing a highly dynamic forecasting mechanism as Election Day approaches. Because winning a state translates into winning all the Electoral College votes for that state (with Nebraska and Maine using Congressional districts to allocate their Electoral College votes), winning by one vote or 100,000 votes results in the same outcome in the Electoral College race. Dynamic programming then uses the posterior probabilities to compile a probability mass function for the Electoral College votes. By design, Election Analytics cuts through the media chatter and focuses purely on data.

If you have ever taken a social science methodologies course then you know:

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos.

is as false as anything uttered by any of the candidates seeking nomination and/or the office of the U.S. presidency since January 1, 2016.

It’s an annoying conceit when you realize that every poll is biased, however clean the subsequent number crunching of the numbers may be.

Bias one step removed isn’t the absence of bias, but the concealment of bias.

September 27, 2016

Bulk Access to the Colin Powell Emails – Update

Filed under: Colin Powell Emails,Government,Journalism,News,Politics,Reporting — Patrick Durusau @ 7:31 pm

Still working on finding a host for the 2.5 GB tarred, gzipped archive of the Colin Powell emails.

As an alternative, working on splitting the attachments (the main source of bulk) from the emails themselves.

My thinking at this point is to produce a message-only version of the emails. Emails with attachments will have auto-generated links to the source emails at DCLeaks.com.

Other processing is planned for the message-only version of the emails.

Anyone interested in indexing the attachments? Generating lists of those with pointers shouldn’t be a problem.

Hope to have more progress to report tomorrow!

September 26, 2016

Bulk Access to the Colin Powell Emails

Filed under: Colin Powell Emails,Government,Politics — Patrick Durusau @ 7:26 pm

The Colin Powell Email leak is important, but if you visit the DCLeaks page for Powell emails, June, July and August of 2014, this is what you find:

dc-leaks-search-460

If you attempt to use the “search” box, you discover that your search is limited to June, July and August of 2014.

Then you remember the main page:

dcleaks-powell-contents-460

Which means every search must be repeated thirteen (13) times to find all relevant emails.

The phone is ringing, your pager is going off, emails and IMs are piling up and your on deadline. How useful is this interface to you as a reporter?

Have your own methods for processing large leaks of documents?

Not relevant here because access the Powell emails is one email at a time.

Put your drinking straw into a lake of 29,641 emails.

Best of luck with that drinking straw approach.

I’m suggesting a different approach.

What if someone automated that drinking straw and created a mirrored set of those 29,641 emails? Along with correcting the twelve (12) emails that chocked a .eml to .mbox converter.

Interested?

Hosting Request: The full data set runs 2.5 GB, which, if popular, is far more traffic than I can support.

Requirements for hosting:

  1. Distribute the file as delivered to you.
  2. Distribute the file for free.

If you are interested, drop me a line at: patrick@durusau.net.

Warning: I have not checked the files or their attachments for malware, hostile links, etc. Open untrusted files in VMs without network connections. At a minimum.

Test your interest against the emails for March-April of 2016: powell-sample.tar.gz. (roughly 108MB)

Manipulation, enhancement and analysis of samples and the full set to follow.

« Newer PostsOlder Posts »

Powered by WordPress