DNC/DCCC/CF Excel Files, As Of October 7, 2016

October 7th, 2016

A continuation of my post Avoiding Viruses in DNC/DCCC/CF Excel Files.

Where Avoiding Viruses… focused on avoiding the hazards and dangers of Excel-born viruses, this post focuses on preparing the DNC/DCCC/CF Excel files from Guccifer 2.0, as of October 7, 2016, for further analysis.

As I mentioned before, you could search through all 517 files to date, separately, using Excel. That thought doesn’t bring me any joy. You?

Instead, I’m proposing that we prepare the files to be concatenated together, resulting in one fairly large file, which we can then search and manipulate as one entity.

As a data cleanliness task, I prefer to prefix every line in every csv export, with the name of its original file. That will enable us to extract lines that mention the same person over several files and still have a bread crumb trail back to the original files.

Munging all the files together without such a step, would leave us either grepping across the collection and/or using some other search mechanism. Why not plan on avoiding that hassle?

Given the number of files requiring prefixing, I suggest the following:

for f in *.csv*; do
sed -i "s/^/$f,/" $f

This shell script uses sed with the -i switch, which means sed changes files in place (think overwriting specified part). Here the s/ means to substitute at the ^, start of each line, $f, the filename plus a comma separator and the final $f, is the list of files to be processed.

There are any number of ways to accomplish this task. Your community may use a different approach.

The result of my efforts is: guccifer2.0-all-spreadsheets-07October2016.gz, which weighs in at 61 MB compressed and 231 MB uncompressed.

I did check and despite having variable row lengths, it does load in my oldish version of gnumeric. All 1030828 lines.

That’s not all surprising for gnumeric, considering I’m running 24 GB of physical RAM. Your performance may vary. (It did hesitate loading it.)

There is much left to be done, such as deciding what padding is needed to even out all the rows. (I have ideas, suggestions?)

Tools to manipulate the CSV. I have a couple of stand-bys and a new one that I located while writing this post.

And, of course, once the CSV is cleaned up, what other means can we use to explore the data?

My focus will be on free and high performance (amazing how often those are found together Larry Ellison) tools that can be easily used for exploring vast seas of spreadsheet data.

Next post on these Excel files, Monday, October 10, 2016.

I am downloading the cf.7z Guccifer 2.0 drop as I write this update.

Watch for updates on the comprehensive file list and Excel files next Monday. October 8, 2016, 01:04 UTC.

Avoiding Viruses in DNC/DCCC/CF Excel Files

October 7th, 2016

I hope you haven’t opened any of the DNC/DCCC/CF Excel files outside of a VM. 517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)


Files from trusted sources can contain viruses. Files from unknown or rogue sources even more so. However tempting (and easy) it is to open up alleged purloined files on your desktop, minimal security conscious users will resist the temptation.

Warning: I did NOT scan the Excel files for viruses. The best way to avoid Excel viruses is to NOT open Excel files.

I used ssconvert, one of the utilities included with gnumeric to bulk convert the Excel files to csv format. (Comma Separate Values is documents in RFC 4780.

Tip: If you are looking for a high performance spreadsheet application, take a look at gnumeric.

Ssconvert relies on file extensions (although other options are available) so I started with:

ssconvert -S donors.xlsx donors.csv

The -S option takes care of workbooks with multiple worksheets. You need a later version of ssconvert (mine is 1.12.9-1 (2013) and the current version of gnumeric and ssconvert is 1.12.31 (August 2016), to convert the .xlsx files without warning.

I’m upgrading to Ubuntu 16.04 soon so it wasn’t worth the trouble trying to stuff a later version of gnumeric/ssconvert onto my present Ubuntu 14.04.

Despite the errors, the conversion appears to have worked properly:


to its csv output:


I don’t see any problems.

I’m checking a sampling of the other conversions as well.

BTW, do notice the confirmation of reports from some commentators that they contacted donors who confirmed donating, but could not recall the amounts.

Could be true. If you pay protection money often enough, I’m sure it’s hard to recall a specific payment.

Sorry, I got distracted.

So, only 516 files to go.

I don’t recommend you do:

ssconvert -S filename.xlsx filename.csv

516 times. That will be tedious and error prone.

At least for Linux, I recommend:

for f in *.xls*; do
   ssconvert -S $f $f.csv

The *.xls* captures both .xsl and .xslx files, then invokes ssconvert -S on the file and then saves the output file with the original name, plus the extension .csv.

The wc -l command reports 1030828 lines in the consolidated csv file for these spreadsheets.

That’s a lot of lines!

I have some suggestions on processing that file, see: DNC/DCCC/CF Excel Files, As Of October 7, 2016.

517 Excel Files Led The Guccifer2.0 Parade (October 6, 2016)

October 6th, 2016

As of today, the data dumps by Guccifer2.0 have contained 517 Excel files.

The vehemence of posts dismissing this dumps makes me wonder two things:

  1. How many of the Excel files these commentators have reviewed?
  2. What is it that you might find in them that worries them so?

I don’t know the answer to #1 and I won’t speculate on their diligence in examining these files. You can reach your own conclusions in that regard.

Nor can I give you an answer to #2, but I may be able to help you explore these spreadsheets.

The old fashioned way, opening each file, at one Excel file per minute, assuming normal Office performance, ;-), would take longer than an eight-hour day to open them all.

You still must understand and compare the spreadsheets.

To make 517 Excel files more than a number, here’s a list of all the Guccifer2.0 released Excel files as of today: guccifer2.0-excel-files-sorted.txt.

(I do have an unfair advantage in that I am willing to share the files I generate, enabling you to check my statements for yourself. A personal preference for fact-based pleading as opposed to conclusory hand waving.)

If you think of each line in the spreadsheets as a record, this sounds like a record linkage problem. Except they have no uniform number of fields, headers, etc.

With record linkage, we would munge all the records into a single record format and then and only then, match up records to see which ones have data about the same subjects.

Thinking about that, the number 517 looms large because all the formats must be reconciled to one master format, before we start getting useful comparisons.

I think we can do better than that.

First step, let’s consider how to create a master record set that keeps all the data as it exists now in the spreadsheets, but as a single file.

See you tomorrow!

Unmasking Tor users with DNS

October 6th, 2016

Unmasking Tor users with DNS by Mark Stockley.

From the post:

Researchers at the KTH Royal Institute of Technology, Stockholm, and Princeton University in the USA have unveiled a new way to attack Tor and deanonymise its users.

The attack, dubbed DefecTor by the researchers’ in their recently published paper The Effect of DNS on Tor’s Anonymity, uses the DNS lookups that accompany our browsing, emailing and chatting to create a new spin on Tor’s most well established weakness; correlation attacks.

If you want the lay-person’s explanation of the DNS issue with Tor, see Mark’s post. If you want the technical details, read The Effect of DNS on Tor’s Anonymity.

The immediate take away for the average user is this:

Donate, volunteer, support the Tor project.

Your privacy or lack thereof is up to you.

Arabic/Russian Language Internet

October 6th, 2016

No matter the result of the 2016 US presidential election, mis-information on areas where Arabic and/or Russian are spoken will increase.

If you are creating topic maps and/or want to do useful reporting on such areas consider:

How to get started investigating the Arabic-language internet by Tom Trewinnard, or,

How to get started investigating the Russian-language internet by Aric Toler.

Any hack can quote releases from official sources and leave their readers uninformed.

A journalist takes monotone “facts” from an “official” release and weaves a story of compelling interest to their readers.

Any other guides to language/country specific advice for journalists?

XQuery Snippets on Gist

October 6th, 2016

@XQuery tweeted today:

Check out some of the 1,637 XQuery code snippets on GitHub’s gist service


Not a bad way to get in a daily dose of XQuery!

You can also try Stack Overflow:

XQuery (3,000)

xquery-sql (293)

xquery-3.0 (70)

xquery-update (55)


Terrorist HoneyPots?

October 6th, 2016

I was reading Checking my honeypot day by Mark Hofman when it occurred to me that discovering CIA/NSA/FBI cybertools may not be as hard as I previously thought.

Imagine creating a <insert-current-popular-terrorist-group-name> website, replete with content ripped off from other terrorist websites, including those sponsored by the U.S. government.

Sharpen your skills at creating fake Twitter followers, AI-generated tweets, etc.

Instead of getting a Booz Allen staffer to betray their employer, you can sit back and collect exploits as they are used.

With just a little imagination, you can create honeypots on and off the Dark Web to attract particular intelligence or law enforcement agencies, security software companies, political hackers and others.

If the FBI can run a porn site, you can use a honeypot to collect offensive cyberweapons.

Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files)

October 5th, 2016

However amusing the headline ‘Guccifer 2.0’ Is Bullshitting Us About His Alleged Clinton Foundation Hack may be, Lorenzo Fanchschi-Bicchierai offers no factual evidence to support his claim,

… the hacker’s latest alleged feat appears to be a complete lie.

Or should I say that:

  • Clinton Foundation denies it has been hacked
  • The Hill whines about who is a donor where
  • The Daily Caller says, “nothing to see here, move along, move along”

hardly qualifies as anything I would rely on.

Checking the file names is one rough check for duplication.

First, you need a set of the file names for all the releases on Guccifer 2.0’s blog:

Relying on file names alone is iffy as the same “content” can be in files with different names, or different content in files with the same name. But this is a rough cut against thousands of documents, so file names it is.

So you can check my work, I saved a copy of the files listed at the blog in date order: guccifer2.0-File-List-By-Blog-Date.txt..

For combining files for use with uniq, you will need a sorted, uniq version of that file: guccifer2.0-File-List-Blog-Sorted-Uniq-lc-final.txt.

Next, there was a major dump of files under the file name 7dc58-ngp-van.7z, approximately 820 MB of files. (Not listed on the blog but from Guccifer 2.0.)

You can use your favorite tool set or grab a copy of: 7dc58-ngp-van-Sorted-Uniq-lc-final.txt.

You need to combine those file names with those from the blog to get a starting set of names for comparison against the alleged Clinton Foundation hack.

Combining those two file name lists together, sorting them and creating a unique list of file names results in: guccifer2.0-30Sept2016-Sorted-Unique.txt.

Follow the same process for ebd-cf.7z, the file that dropped on the 3rd of October 2016. Or grab: ebd-cf-file-Sorted-Uniq-lc-final.txt.

Next, combine guccifer2.0-30Sept2016-Sorted-Unique.txt (the files we knew about before the 3rd of October) with ebd-cf-file-Sorted-Uniq.txt, and sort those file names, resulting in: guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt.

The final step is to apply uniq -d to guccifer2.0-30Sept2016-plus-ebd-cf-file-Sorted.txt, which should give you the duplicate files, comparing the files in ebd-cf.7z to those known before September 30, 2016.

The results?

11-26-08 nfc members raised.xlsx

Seven files out of 2085 doesn’t sound like a high degree of duplication.

At least not to me.


PS: On the allegations about the Russians, you could ask the Communists in the State Department or try the Army General Staff. ;-) Some of McCarty’s records are opening up if you need leads.

PPS: Use the final sorted, unique file list to check future releases by Guccifer 2.0. It might help you avoid bullshitting the public.

#Guccifer 2.0 Drop – Oct. 4, 2016 – File List

October 4th, 2016

While you wait for your copy of the October 4, 2016 drop by #Guccifer 2.0 to download, you may want to peruse the file list for that drop: ebd-cf-file-list.gz.

A good starting place for comments on this drop is: Guccifer 2.0 posts DCCC docs, says they’re from Clinton Foundation – Files appear to be from Democratic Congressional Campaign Committee and DNC hacks. by Sean Gallagher.

The paragraph in Sean’s post that I find the most interesting is:

However, a review by Ars found that the files are clearly not from the Clinton Foundation. While some of the individual files contain real data, much of it came from other breaches Guccifer 2.0 has claimed credit for at the Democratic National Committee and the Democratic Congressional Campaign Committee—hacks that researchers and officials have tied to “threat groups” connected to the Russian Government. Other data could have been aggregated from public information, while some appears to be fabricated as propaganda.

To verify Sean’s claim of duplication, compare the file names in this dump against those from prior dumps.

Sean is not specific about which files/data are alleged to be “fabricated as propaganda.”

I continue to be amused by allegations of Russian Government involvement. When seeking funding, Russian (substitute other nationalities) possess super-human hacking capabilities. Yet, in cases like this one, which regurgitates old data, Russian Government involvement is presumed.

The inconsistency between Russian Government super-hackers and Russian Government copy-n-paste data leaks, doesn’t seem to be getting much play in the media.

Perhaps you can help on that score.


An introduction to data cleaning with R

October 4th, 2016

An introduction to data cleaning with R by Edwin de Jonge and Mark van der Loo.


Data cleaning, or data preparation is an essential part of statistical analysis. In fact, in practice it is often more time-consuming than the statistical analysis itself. These lecture notes describe a range of techniques, implemented in the R statistical environment, that allow the reader to build data cleaning scripts for data suffering from a wide range of errors and inconsistencies, in textual format. These notes cover technical as well as subject-matter related aspects of data cleaning. Technical aspects include data reading, type conversion and string matching and manipulation. Subject-matter related aspects include topics like data checking, error localization and an introduction to imputation methods in R. References to relevant literature and R packages are provided throughout.

These lecture notes are based on a tutorial given by the authors at the useR!2013 conference in Albacete, Spain.

Pure gold!

Plus this tip (among others):

Tip. To become an R master, you must practice every day.

The more data you clean, the better you will become!


Deep-Fried Data […money laundering for bias…]

October 4th, 2016

Deep-Fried Data by Maciej Ceglowski. (paper) (video of same presentation) Part of Collections as Data event at the Library of Congress.

If the “…money laundering for bias…” quote doesn’t capture your attention, try:

I find it helpful to think of algorithms as a dim-witted but extremely industrious graduate student, whom you don’t fully trust. You want a concordance made? An index? You want them to go through ten million photos and find every picture of a horse? Perfect.

You want them to draw conclusions on gender based on word use patterns? Or infer social relationships from census data? Now you need some adult supervision in the room.

Besides these issues of bias, there’s also an opportunity cost in committing to computational tools. What irks me about the love affair with algorithms is that they remove a lot of the potential for surprise and serendipity that you get by working with people.

If you go searching for patterns in the data, you’ll find patterns in the data. Whoop-de-doo. But anything fresh and distinctive in your digital collections will not make it through the deep frier.

We’ve seen entire fields disappear down the numerical rabbit hole before. Economics came first, sociology and political science are still trying to get out, bioinformatics is down there somewhere and hasn’t been heard from in a while.

A great read and equally enjoyable presentation.


Moral Machine [Research Design Failure]

October 4th, 2016

Moral Machine

From the webpage:

Welcome to the Moral Machine! A platform for gathering a human perspective on moral decisions made by machine intelligence, such as self-driving cars.

We show you moral dilemmas, where a driverless car must choose the lesser of two evils, such as killing two passengers or five pedestrians. As an outside observer, you judge which outcome you think is more acceptable. You can then see how your responses compare with those of other people.

If you’re feeling creative, you can also design your own scenarios, for you and others to browse, share, and discuss.

The first time I recall hearing this type of discussion was over thirty years ago when a friend, taking an ethics class related the following problem:

You are driving a troop transport with twenty soldiers in the back and are about to enter a one lane bridge. You see a baby sitting in the middle of the bridge. Do you serve, going down an embankment, killing all on board or do you go straight?

A lively college classroom discussion erupted and continued for the entire class. Various theories and justifications were offered, etc. When the class bell rang, the professor announced the child perished 59 minutes, 59 seconds ago.

As you may guess, not a single person in the class called out “Swerve” when the question was posed.

The exercise was to illustrate that many “moral” decisions are made at the limits of human reaction time. Typically, 150 and 300 milliseconds. (Speedy Science: How Fast Can You React? is a great activity from Scientific American to test your reaction time.)

The examples in MIT’s Moral Machine perpetuate the myth that moral decisions are the result of reflection and consideration of multiple factors.

Considered moral decisions do exist. Dietrich Bonhoeffer deciding to participate in a conspiracy to assassinate Adolf Hitler. Lyndon Johnson supporting civil rights in the South. But those are not the subject of the “Moral Machine.”

Nor is the “Moral Machine” even a useful simulation of what a driven and/or driverless car would confront. Visibility isn’t an issue as it often is, there are no distractions, no smart phones ringing, no conflicting input from passengers, etc.

In short, the “Moral Machine” creates a fictional choice, about which to solicit your “moral” advice, under conditions you will never experience.

Separating pedestrians from vehicles (once suggested by Buckminster Fuller I think) is a far more useful exercise than college level discussion questions.

Resource: Malware analysis – …

October 4th, 2016

Resource: Malware analysis – learning How To Reverse Malware: A collection of guides and tools by Claus Cramon Houmann.

This resource will provide you theory around learning malware analysis and reverse engineering malware. We keep the links up to date as the infosec community creates new and interesting tools and tips.

Some technical reading to enjoy instead of political debates!


“Just the texts, Ma’am, just the texts” – Colin Powell Emails Sans Attachments

October 3rd, 2016

As I reported in Bulk Access to the Colin Powell Emails – Update, I was looking for a host for the complete Colin Powell emails at 2.5 GB, but I failed on that score.

I can’t say if that result is lack of interest in making the full emails easily available or if I didn’t ask the right people. Please circulate my request when you have time.

In the meantime, I have been jumping from one “easy” solution to another, most of which involved parsing the .eml files.

But my requirement is to separate the attachment from the emails, quickly and easily. Not to parse the .eml files in preparation for further process.

How does a 22 character, command line sed expression sound?

Do you know of an “easier” solution?

sed -i '/base64/,$d' *

Reasoning the first attachment (in the event of multiple attachments) will include the string “base64″ so I pass a range expression that starts there and ends at the end of the message “$” and delete that pattern, d, and write the files in place “-i.”

There are far more sophisticated solutions to this problem but as crude as this may be, I have reduced the 2.5 GB archive file that includes all the emails and their attachments down to 63 megabytes.

Attachments are important too but my first steps were to make these and similar files more accessible.

Obtaining > 29K files through the drinking straw at DCLeaks or waiting until I find a host for a consolidated 2.5 GB files, doesn’t make these files more accessible.

A 63 MB download of the Colin Powells Emails With No Attachments may.

Please feel free to mirror these files.

PS: One oddity I noticed in testing the download. With Chrome, the file size inflates to 294MB. With Mozilla, the file size is 65MB. ? Both unpack properly. Suggestions?

PPS: More sophisticated processing of the raw emails and other post-processing to follow.

Security Community “Reasoning” About Botnets (and malware)

October 2nd, 2016

In case you missed it: Source Code for IoT Botnet ‘Mirai’ Released by Brian Krebs offers this “reasoning” about a recent release of botnet software:

The source code that powers the “Internet of Things” (IoT) botnet responsible for launching the historically large distributed denial-of-service (DDoS) attack against KrebsOnSecurity last month has been publicly released, virtually guaranteeing that the Internet will soon be flooded with attacks from many new botnets powered by insecure routers, IP cameras, digital video recorders and other easily hackable devices.

The leak of the source code was announced Friday on the English-language hacking community Hackforums. The malware, dubbed “Mirai,” spreads to vulnerable devices by continuously scanning the Internet for IoT systems protected by factory default or hard-coded usernames and passwords.

Being a recent victim of a DDoS attack, perhaps Kerbs anger about the release of Mirai is understandable. But only to a degree.

Non-victims of such DDoS attacks have been quick to take up the “sky is falling” refrain.

Consider Hacker releases code for huge IoT botnet, or, Hacker Releases Code That Powered Record-Breaking Botnet Attack, or, Brace yourselves—source code powering potent IoT DDoSes just went public: Release could allow smaller and more disciplined Mirai botnet to go mainstream, as samples.

Mirai is now available to “anyone” but where the reasoning of Kerbs and others breaks down is there is no evidence that “everyone” wants to run a botnet.

Even if the botnet was as easy (sic) to use as Outlook.

For example, gun ownership in the United States is now at 36% of the adult population, but roughly one-third of the population will not commit murder this coming week.

As of 2010, there were roughly 210 million licensed drivers in the United States. Yet, this coming week, it is highly unlikely that any of them will commandeer a truck and run down pedestrians with it.

The point is that the vast majority of users, even if they were competent to read and use the Mirai code, aren’t criminals. Nor does possession of the Mirai code make them criminals.

It could be they are just curious. Or interested in how it was coded. Or, by some off chance, they could even have good intentions and want to study it to fight botnets.

Attempting to prevent the spread of information hasn’t resulted in any apparent benefit, at least to the cyber community at large.

Perhaps its time to treat the cyber community as adults, some of who will make good decisions and some less so.

Value-Add Of Mapping The Food Industry

October 2nd, 2016

Did you know that ten (10) companies control all of the major food/drink brands in the world?


(From These 10 companies control everything you buy, where you can find a larger version of this image.)

You could, with enough searching, have put together all ten of these mini-maps, but then that effort would have to be repeated by everyone seeking the same information.

But, instead of duplicating an initial investment to identify players and their relationships, you can focus on identifying their IP addresses, process control machinery, employees, and other useful data.

What are your value-add of mapping examples?

Nuremberg Trial Verdicts [70th Anniversary]

October 1st, 2016

Nuremberg Trial Verdicts by Jenny Gesley.

From the post:

Seventy years ago – on October 1, 1946 – the Nuremberg trial, one of the most prominent trials of the last century, concluded when the International Military Tribunal (IMT) issued the verdicts for the main war criminals of the Second World War. The IMT sentenced twelve of the defendants to death, seven to terms of imprisonment ranging from ten years to life, and acquitted three.

The IMT was established on August 8, 1945 by the United Kingdom (UK), the United States of America, the French Republic, and the Union of Soviet Socialist Republics (U.S.S.R.) for the trial of war criminals whose offenses had no particular geographical location. The defendants were indicted for (1) crimes against peace, (2) war crimes, (3) crimes against humanity, and of (4) a common plan or conspiracy to commit those aforementioned crimes. The trial began on November 20, 1945 and a total of 403 open sessions were held. The prosecution called thirty-three witnesses, whereas the defense questioned sixty-one witnesses, in addition to 143 witnesses who gave evidence for the defense by means of written answers to interrogatories. The hearing of evidence and the closing statements were concluded on August 31, 1946.

The individuals named as defendants in the trial were Hermann Wilhelm Göring, Rudolf Hess, Joachim von Ribbentrop, Robert Ley, Wilhelm Keitel, Ernst Kaltenbrunner, Alfred Rosenberg, Hans Frank, Wilhelm Frick, Julius Streicher, Walter Funk, Hjalmar Schacht, Karl Dönitz, Erich Raeder, Baldur von Schirach, Fritz Sauckel, Alfred Jodl, Martin Bormann, Franz von Papen, Arthur Seyss-Inquart, Albert Speer, Constantin von Neurath, Hans Fritzsche, and Gustav Krupp von Bohlen und Halbach. All individual defendants appeared before the IMT, except for Robert Ley, who committed suicide in prison on October 25, 1945; Gustav Krupp von Bolden und Halbach, who was seriously ill; and Martin Borman, who was not in custody and whom the IMT decided to try in absentia. Pleas of “not guilty” were entered by all the defendants.

The trial record is spread over forty-two volumes, “The Blue Series,” Trial of the Major War Criminals before the International Military Tribunal Nuremberg, 14 November 1945 – 1 October 1946.

All forty-two volumes are available in PDF format and should prove to be a more difficult indexing, mining, modeling, searching challenge than twitter feeds.

Imagine instead of “text” similarity, these volumes were mined for “deed” similarity. Similarity to deeds being performed now. By present day agents.

Instead of seldom visited dusty volumes in the library stacks, “The Blue Series” could develop a sharp bite.

Data Science Toolbox

October 1st, 2016

Data Science Toolbox

From the webpage:

Start doing data science in minutes

As a data scientist, you don’t want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

  • Python, with the following packages: IPython Notebook, NumPy, SciPy, matplotlib, pandas, scikit-learn, and SymPy.
  • R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.
  • dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

Great resource for doing or teaching data science!

And an example of using a VM to distribute software in a learning environment.

Type-driven Development … [Further Reading]

October 1st, 2016

The Further Reading slide from Edwin Brady’s presentation Type-driven Development of Communicating Systems in Idris (Lamda World, 2016) was tweeted as an image, eliminating the advantages of hyperlinks.

I have reproduced that slide with the links as follows:

Further Reading

On total functional programming

On interactive programming with dependent types

On types for communicating systems:

On Wadler’s paper, you may enjoy the video of his presentation, Propositions as Sessions or his slides (2016), Propositions as Sessions, Philip Wadler, University of Edinburgh, Betty Summer School, Limassol, Monday 27 June 2016.

Government Contractor Persistence

October 1st, 2016

Persistence of data is a hot topic in computer science but did you know government contractors exhibit persistence as well?

Remember the 22,000,000+ record leak from the US Office of Personnel Management?

Leaks don’t happen on their own and it turns out that Keypoint Government Solutions was weak link in the chain that resulted in that loss.

Cory Doctorow reports in Company suspected of blame in Office of Personnel Management breach will help run new clearance agency:

It’s still not clear how OPM got hacked, but signs point to a failure at one of its contractors, Keypoint Government Solutions, who appear to have lost control of their logins/passwords for sensitive OPM services.

In the wake of the hacks, the job of giving out security clearances has been given to a new government agency, the National Background Investigations Bureau.

NBIB is about to get started, and they’ve announced that they’re contracting out significant operations to Keypoint. Neither Keypoint nor the NBIB would comment on this arrangement.

The loss of 22,000,000 records?, well, that could happen to anybody.


Initiatives, sprints, proclamations, collaborations with industry, academia, etc., are unlikely to change the practice of cybersecurity in the U.S. government.

Changing cybersecurity practices in government requires:

  • Elimination of contractor persistence. One failure is enough.
  • Immediate and permanent separation of management and staff who fail to implement and follow standard security practices.
  • Separated staff and management barred from employment with any contractor with the government, permanently.
  • Staff of prior failed contractors barred from employment at present contractors. (An incentive for contractor staff to report shortfalls in current contracts.)
  • Multi-year funded contracts that include funding for independent red team testing of security.

A no consequences for failure of security policy defeats all known security policies.

Version 2 of the Hubble Source Catalog [Model For Open Access – Attn: Security Researchers]

September 30th, 2016

Version 2 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Version 2 includes:

  • Four additional years of ACS source lists (i.e., through June 9, 2015). All ACS source lists go deeper than in version 1. See current HLA holdings for details.
  • One additional year of WFC3 source lists (i.e., through June 9, 2015).
  • Cross-matching between HSC sources and spectroscopic COS, FOS, and GHRS observations.
  • Availability of magauto values through the MAST Discovery Portal. The maximum number of sources displayed has increased from 10,000 to 50,000.

The HSC v2 contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists from HLA version DR9.1 (data release 9.1). The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. The relative astrometry is supported by using Pan-STARRS, SDSS, and 2MASS as the astrometric backbone for initial corrections. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012).


There are currently three ways to access the HSC as described below. We are working towards having these interfaces consolidated into one primary interface, the MAST Discovery Portal.

  • The MAST Discovery Portal provides a one-stop web access to a wide variety of astronomical data. To access the Hubble Source Catalog v2 through this interface, select Hubble Source Catalog v2 in the Select Collection dropdown, enter your search target, click search and you are on your way. Please try Use Case Using the Discovery Portal to Query the HSC
  • The HSC CasJobs interface permits you to run large and complex queries, phrased in the Structured Query Language (SQL).
  • HSC Home Page

    – The HSC Summary Search Form displays a single row entry for each object, as defined by a set of detections that have been cross-matched and hence are believed to be a single object. Averaged values for magnitudes and other relevant parameters are provided.

    – The HSC Detailed Search Form displays an entry for each separate detection (or nondetection if nothing is found at that position) using all the relevant Hubble observations for a given object (i.e., different filters, detectors, separate visits).

Amazing isn’t it?

The astronomy community long ago vanquished data hoarding and constructed tools to avoid moving very large data sets across the network.

All while enabling more and not less access and research using the data.

Contrast that to the sorry state of security research, where example code is condemned, if not actually prohibited by law.

Yet, if you believe current news reports (always an iffy proposition), cybercrime is growing by leaps and bounds. (PwC Study: Biggest Increase in Cyberattacks in Over 10 Years)

How successful is the “data hoarding” strategy of the security research community?

Going My Way? – Explore 1.2 billion taxi rides

September 30th, 2016

Explore 1.2 billion taxi rides by Hannah Judge.

From the post:

Last year the New York City Taxi and Limousine Commission released a massive dataset of pickup and dropoff locations, times, payment types, and other attributes for 1.2 billion trips between 2009 and 2015. The dataset is a model for municipal open data, a tool for transportation planners, and a benchmark for database and visualization platforms looking to test their mettle.

MapD, a GPU-powered database that uses Mapbox for its visualization layer, made it possible to quickly and easily interact with the data. Mapbox enables MapD to display the entire results set on an interactive map. That map powers MapD’s dynamic dashboard, updating the data as you zoom and pan across New York.

Very impressive demonstration of the capabilities of MapD!

Imagine how you can visualize data from your hundreds of users geo-spotting security forces with their smartphones.

Or visualizing data from security forces tracking your citizens.

Technology cuts both ways.

The question is whether the sharper technology sword is going to be in your hands or those of your opponents?

Introducing the Open Images Dataset

September 30th, 2016

Introducing the Open Images Dataset by Ivan Krasin and Tom Duerig.

From the post:

In the last few years, advances in machine learning have enabled Computer Vision to progress rapidly, allowing for systems that can automatically caption images to apps that can create natural language replies in response to shared photos. Much of this progress can be attributed to publicly available image datasets, such as ImageNet and COCO for supervised learning, and YFCC100M for unsupervised learning.

Today, we introduce Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. We tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license*.

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API. For the validation set, we had human raters verify these automated labels to find and remove false positives. On average, each image has about 8 labels assigned. Here are some examples:

Impressive data set, if you want to recognize a muffin, gherkin, pebble, etc., see the full list at dict.csv.

Hopeful the techniques you develop with these images will lead to more focused image recognition. ;-)

I lightly searched the list and no “non-safe” terms jumped out at me. Suitable for family image training.

ggplot2 2.2.0 coming soon! [Testers Needed!]

September 30th, 2016

ggplot2 2.2.0 coming soon! by Hadley Wickham.

From the post:

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:


ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

Just in case you are casual about time, tomorrow is October 1st. Which on most calendars means that “early November” isn’t far off.

Here’s an easy opportunity to test ggplot2.2.2.0 and related visualization packages. Before the official release.


ORWL – Downside of a Physically Secure Computer

September 30th, 2016

Meet ORWL. The first open source, physically secure computer


If someone has physical access to your computer with secure documents present, it’s game over! ORWL is designed to solve this as the first open source physically secure computer. ORWL (pronounced or-well) is the combination of the physical security from the banking industry (used in ATMs and Point of Sale terminals) and a modern Intel-based personal computer. We’ve designed a stylish glass case which contains the latest processor from Intel – exactly the same processor as you would find in the latest ultrabooks and we added WiFi and Bluetooth wireless connectivity for your accessories. It also has two USB Type C connectors for any accessories you prefer to connect via cables. We then use the built-in Intel 515 HD Video which can output up to 4K video with audio.

The physical security enhancements we’ve added start with a second authentication factor (wireless keyfob) which is processed before the main processor is even powered up. This ensures we are able to check the system’s software for authenticity and security before we start to run it. We then monitor how far your keyfob is from your PC – when you leave the room, your PC will be locked automatically, requiring the keyfob to unlock it again. We’ve also ensured that all information on the system drive is encrypted via the hardware on which it runs. The encryption key for this information is managed by the secure microcontroller which also handles the pre-boot authentication and other security features of the system. And finally, we protect everything with a high security enclosure (inside the glass) that prevents working around our security by physically accessing hardware components.

Any attempt to get physical access to the internals of your PC will delete the cryptographic key, rendering all your data permanently inaccessible!

The ORWL is a good illustration that good security policies can lead to unforeseen difficulties.

Or as the blog post brags:

Any attempt to get physical access to the internals of your PC will delete the cryptographic key, rendering all your data permanently inaccessible!

All I need do to deprive you of your data (think ransomware), is to physically tamper with your ORWL.

Of interest to journalists who need the ability to deprive others of data on very short notice.

Perhaps a fragile version for journalists and a more resistance to abuse version for the average user.


Multiple Backdoors found in D-Link DWR-932 B LTE Router [There is an upside.]

September 29th, 2016

Multiple Backdoors found in D-Link DWR-932 B LTE Router by Swati Khandelwal.

From the post:

If you own a D-Link wireless router, especially DWR-932 B LTE router, you should get rid of it, rather than wait for a firmware upgrade that never lands soon.

D-Link DWR-932B LTE router is allegedly vulnerable to over 20 issues, including backdoor accounts, default credentials, leaky credentials, firmware upgrade vulnerabilities and insecure UPnP (Universal Plug-and-Play) configuration.

If successfully exploited, these vulnerabilities could allow attackers to remotely hijack and control your router, as well as network, leaving all connected devices vulnerable to man-in-the-middle and DNS poisoning attacks.

Moreover, your hacked router can be easily abused by cybercriminals to launch massive Distributed Denial of Service (DDoS) attacks, as the Internet has recently witnessed record-breaking 1 Tbps DDoS attack that was launched using more than 150,000 hacked Internet-connected smart devices.

Security researcher Pierre Kim has discovered multiple vulnerabilities in the D-Link DWR-932B router that’s available in several countries to provide the Internet with an LTE network.

The current list on this cyber-horror at Amazon.uk is £95.97. Wow!

Once word spreads about its swiss-cheese like security characteristics, one hopes its used price will fall rapidly.

Swati’s post makes the start of a great checklist for grading penetration of the router for exam purposes.


PS: I’m willing to pay $10.00 plus shipping for one. (Contact me for details.)

The Simpsons by the Data [South Park as well]

September 29th, 2016

The Simpsons by the Data by Todd Schneider.

From the post:

The Simpsons needs no introduction. At 27 seasons and counting, it’s the longest-running scripted series in the history of American primetime television.

The show’s longevity, and the fact that it’s animated, provides a vast and relatively unchanging universe of characters to study. It’s easier for an animated show to scale to hundreds of recurring characters; without live-action actors to grow old or move on to other projects, the denizens of Springfield remain mostly unchanged from year to year.

As a fan of the show, I present a few short analyses about Springfield, from the show’s dialogue to its TV ratings. All code used for this post is available on GitHub.

Alert! You must run Flash in order to access Simpsons World, the source of Todd’s data.

Advice: Treat Flash as malware and run in a VM.

Todd covers the number of words spoken per character, gender imbalance, focus on characters, viewership, and episode summaries (tf-idf).

Other analysis awaits your imagination and interest.

BTW, if you want comedy data a bit closer to the edge, try Text Mining South Park by Kaylin Walker. Kaylin uses R for her analysis as well.

Other TV programs with R-powered analysis?

Graph Computing with Apache TinkerPop

September 29th, 2016

From the description:

Apache TinkerPop serves as an Apache governed, vendor-agnostic, open source initiative providing a standard interface and query language for both OLTP- and OLAP-based graph systems. This presentation will outline the means by which vendors implement TinkerPop and then, in turn, how the Gremlin graph traversal language is able to process the vendor’s underlying graph structure. The material will be presented from the perspective of the DSEGraph team’s use of Apache TinkerPop in enabling graph computing features for DataStax Enterprise customers.

Slides: https://www.slideshare.net/DataStax/datastax-graph-computing-with-apache-tinkerpop-marko-rodriguez-cassandra-summit-2016

Marko is brutally honest.

He warns the early part of his presentation is stream of consciousness and that is the truth!


That takes you to time mark 11:37 and the description of Gremlin as a language begins.

Marko slows, momentarily, but rapidly picks up speed.

Watch the video, then grab the slides and mark what has captured your interest. Use the slides as your basis for exploring Gremlin and Apache TinkerPop documentation.


Are You A Moral Manipulator?

September 29th, 2016

I appreciated Nir’s reminder about the #1 rule for drug dealers.

If you don’t know it, the video is only a little over six minutes long.


Election Prediction and STEM [Concealment of Bias]

September 28th, 2016

Election Prediction and STEM by Sheldon H. Jacobson.

From the post:

Every U.S. presidential election attracts the world’s attention, and this year’s election will be no exception. The decision between the two major party candidates, Hillary Clinton and Donald Trump, is challenging for a number of voters; this choice is resulting in third-party candidates like Gary Johnson and Jill Stein collectively drawing double-digit support in some polls. Given the plethora of news stories about both Clinton and Trump, November 8 cannot come soon enough for many.

In the Age of Analytics, numerous websites exist to interpret and analyze the stream of data that floods the airwaves and newswires. Seemingly contradictory data challenges even the most seasoned analysts and pundits. Many of these websites also employ political spin and engender subtle or not-so-subtle political biases that, in some cases, color the interpretation of data to the left or right.

Undergraduate computer science students at the University of Illinois at Urbana-Champaign manage Election Analytics, a nonpartisan, easy-to-use website for anyone seeking an unbiased interpretation of polling data. Launched in 2008, the site fills voids in the national election forecasting landscape.

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos. The methodologies used by Election Analytics include Bayesian statistics, which estimate the posterior distributions of the true proportion of voters that will vote for each candidate in each state, given both the available polling data and the states’ previous election results. Each poll is weighted based on its age and its size, providing a highly dynamic forecasting mechanism as Election Day approaches. Because winning a state translates into winning all the Electoral College votes for that state (with Nebraska and Maine using Congressional districts to allocate their Electoral College votes), winning by one vote or 100,000 votes results in the same outcome in the Electoral College race. Dynamic programming then uses the posterior probabilities to compile a probability mass function for the Electoral College votes. By design, Election Analytics cuts through the media chatter and focuses purely on data.

If you have ever taken a social science methodologies course then you know:

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos.

is as false as anything uttered by any of the candidates seeking nomination and/or the office of the U.S. presidency since January 1, 2016.

It’s an annoying conceit when you realize that every poll is biased, however clean the subsequent number crunching of the numbers may be.

Bias one step removed isn’t the absence of bias, but the concealment of bias.