Archive for July, 2016

How-To Track Projects Like A Defense Contractor

Sunday, July 31st, 2016

Transparency Tip: How to Track Government Projects Like a Defense Contractor by Dave Maass.

From the post:

Over the last year, thousands of pages of sensitive documents outlining the government’s intelligence practices have landed on our desktops.

One set of documents describes the Director of National Intelligence’s goal of funding “dramatic improvements in unconstrained face recognition.” A presentation from the Navy uses examples from Star Trek to explain its electronic warfare program. Other records show the FBI was purchasing mobile phone extraction devices, malware and fiber network-tapping systems. A sign-in list shows the names and contact details of hundreds of cybersecurity contractors who turned up a Department of Homeland Security “Industry Day.” Yet another document, a heavily redacted contract, provides details of U.S. assistance with drone surveillance programs in Burundi, Kenya and Uganda.

But these aren’t top-secret records carefully leaked to journalists. They aren’t classified dossiers pasted haphazardly on the Internet by hacktivists. They weren’t even liberated through the Freedom of Information Act. No, these public documents are available to anyone who looks at the U.S. government’s contracting website, In this case “anyone,” is usually just contractors looking to sell goods, services, or research to the government. But, because the government often makes itself more accessible to businesses than the general public, it’s also a useful tool for watchdogs. Every government program costs money, and whenever money is involved, there’s a paper trail.

Searching is difficult enough that there are firms that offer search services to assist contractors with locating business opportunities.

Collating data with topic maps (read adding data) will be a value-add to watchdogs, potential contractors (including yourself), or watchers watching watchers.

Dave’s post will get you started on your way.

Digital Humanities In the Library

Sunday, July 31st, 2016

Digital Humanities In the Library / Of the Library: A dh+lib Special Issue

A special issue of dh + lib introduced by Sarah Potvin, Thomas Padilla and Caitlin Christian-Lamb in their essay: Digital Humanities In the Library / Of the Library, saying:

What are the points of contact between digital humanities and libraries? What is at stake, and what issues arise when the two meet? Where are we, and where might we be going? Who are “we”? By posing these questions in the CFP for a new dh+lib special issue, the editors hoped for sharp, provocative meditations on the state of the field. We are proud to present the result, ten wide-ranging contributions by twenty-two authors, collectively titled “Digital Humanities In the Library / Of the Library.”

We make the in/of distinction pointedly. Like the Digital Humanities (DH), definitions of library community are typically prefigured by “inter-” and “multi-” frames, rendered as work and values that are interprofessional, interdisciplinary, and multidisciplinary. Ideally, these characterizations attest to diversified yet unified purpose, predicated on the application of disciplinary expertise and metaknowledge to address questions that resist resolution from a single perspective. Yet we might question how a combinatorial impulse obscures the distinct nature of our contributions and, consequently, our ability to understand and respect individual agency. Working across the similarly encompassing and amorphous contours of the Digital Humanities compels the library community to reckon with its composite nature.

All of the contributions merit your attention but I was especially taken by: When Metadata Becomes Outreach: Indexing, Describing, and Encoding For DH by Emma Annette Wilson and Mary Alexander has this gem that will resonate with topic map fans:

DH projects require high-quality metadata in order to thrive, and the bigger the project, the more important that metadata becomes to make data discoverable, navigable, and open to computational analysis. The functions of all metadata are to allow our users to identify and discover resources through records acting as surrogates of resources, and to discover similarities, distinctions, and other nuances within single texts or across a corpus. High quality metadata brings standardization to the project by recording elements’ definitions, obligations, repeatability, rules for hierarchical structure, and attributes. Input guidelines and the use of controlled vocabularies bring consistencies that promote findability for researchers and users alike.

Modulo my reservations about the data/metadata distinction depending upon a point of view and all of them being subjects in any event, its hard to think of a clearer statement of the value that a topic map could bring to a DH project.

Consistencies can peacefully co-exist with with historical or present-day inconsistencies, at least so long as you are using a topic map.

I commend the entire issue to your for reading!

NGREP – Julia Evans

Sunday, July 31st, 2016

Julia Evans demonstrates how to get around the limits of Twitter and introduces you to a “starter network spy tool.”


A demonstration of her writing skills as well!

Ngrep at sourceforge.

Installing on Ubuntu 14.04:

sudo apt-get update
sudo apt-get install ngrep

I’m a follower of Julia’s but even so, I checked the man page for ngrep before running the example.

The command:

sudo ngrep -d any metafilter is interpreted:

sudo – runs ngrep as superuser (hence my caution)

ngrep – network grep

-d any – ngrep listens to “any” interface *

metafilter – match expression, packets that match are dumped.

* The “any” value following -d was the hardest value to track down. The man page for ngrep describes the -d switch this way:

-d dev

By default ngrep will select a default interface to listen on. Use this option to force ngrep to listen on interface dev.

Well, that’s less than helpful. 😉

Until you discover on the tcpdump man page:

Listen on interface. If unspecified, tcpdump searches the system interface list for the lowest numbered, configured up interface (excluding loopback), which may turn out to be, for example, “eth0”.
On Linux systems with 2.2 or later kernels, an interface argument of “any” can be used to capture packets from all interfaces. Note that captures on the “any” device will not be done in promiscuous mode. (bold highlight added)

If you are running a Linux system with a 2.2 or later kernel, you can use the “any” argument to the interface -d switch of ngrep.

Understanding the entire command, I then felt safe running it as root. 😉 Not that I expected a bad outcome but I learned something in the process of researching the command.

Be aware that ngrep is a plethora of switches, options, bpf filters (Berkeley packet filters) and the like. The man page runs eight pages of, well, man page type material.


Who Decides On Data Access?

Sunday, July 31st, 2016

In a Twitter dust-up following The Privileged Cry: Boo, Hoo, Hoo Over Release of OnionScan Data the claim was made by [Λ•]ltSciFi@altscifi_that:

@SarahJamieLewis You take an ethical stance. @patrickDurusau does not. Note his regression to a childish tone. Also:…

To which I responded:

@altscifi_ @SarahJamieLewis Interesting. Questioning genuflection to privilege is a “childish tone?” Is name calling the best you can do?

Which earned this response from [Λ•]ltSciFi@altscifi_:

@patrickDurusau @SarahJamieLewis Not interested in wasting time arguing with you. Your version of “genuflection” doesn’t merit the effort.

Anything beyond name calling is too much effort for [Λ•]ltSciFi@altscifi_. Rather than admit they haven’t thought about the issue of the ethics of data access beyond “me too!,” it saves face to say discussion is a waste of time.

I have never denied that access to data can raise ethical issues or that such issues merit discussion.

What I do object to is that in such discussions, it has been my experience (important qualifier), that those urging ethics of data access have someone in mind to decide on data access. Almost invariably, themselves.

Take the recent “weaponized transparency” rhetoric of the Sunlight Foundation as an example. We can argue about the ethics of particular aspects of the DNC data leak, but the fact remains that the Sunlight Foundation considers itself, and not you, as the appropriate arbiter of access to an unfiltered version of that data.

I assume the Sunlight Foundation would include as appropriate arbiters many of the usual news organizations what accept leaked documents and reveal to the public only so much as they choose to reveal.

Not to pick on the Sunlight Foundation, there is an alphabet soup of U.S. government agencies that make similar claims of what should or should not be revealed to the public. I have no more sympathy for their claims of the right to limit data access than more public minded organizations.

Take the data dump of OnionScan data for example. Sarah Jamie Lewis may choose to help sites for victims of abuse (a good thing in my opinion) whereas others of us may choose to fingerprint and out government spy agencies (some may see that as a bad thing).

The point being that the OnionScan data dump enables more people to make those “ethical” choices and to not be preempted because data such as the OnionScan data should not be widely available.

BTW, in a later tweet Sarah Jamie Lewis says:

In which I am called privileged for creating an open source tool & expressing concerns about public deanonymization.

Missing the issue entirely as she was quoted as expressing concerns over the OnionScan data dump. Public deanonymization, is a legitimate concern so long as we all get to decide those concerns for ourselves. Lewis is trying to dodge the issue of her weak claim over the data dump for the stronger one over public deanonymization.

Unlike most of the discussants you will find, I don’t want to decide on what data you can or cannot see.

Why would I? I can’t foresee all uses and/or what data you might combine it with. Or with what intent?

If you consider the history of data censorship by governments, we haven’t done terribly well in our choices of censors or in the results of their censorship.

Let’s allow people to exercise their own sense of ethics. We could hardly do worse than we have so far.

Pandas Exercises

Saturday, July 30th, 2016

Pandas Exercises

From the post:

Fed up with a ton of tutorials but no easy way to find exercises I decided to create a repo just with exercises to practice pandas. Don’t get me wrong, tutorials are great resources, but to learn is to do. So unless you practice you won’t learn.

There will be three different types of files:

  1. Exercise instructions
  2. Solutions without code
  3. Solutions with code and comments

My suggestion is that you learn a topic in a tutorial or video and then do exercises. Learn one more topic and do exercises. If you got the answer wrong, don’t go to the solution with code, follow this advice instead.

Suggestions and collaborations are more than welcome. 🙂

I’m sure you will find this useful but when I search for pandas exercise python, I get 298,000 “hits.”

Adding exercises here isn’t going to improve the findability of pandas for particular subject areas or domains.

Perhaps as exercises are added here, links to exercises by subject area can be added as well.

With nearly 300K potential sources, there is no shortage of exercises to go around!

The Privileged Cry: Boo, Hoo, Hoo Over Release of OnionScan Data

Saturday, July 30th, 2016

It hasn’t taken long for the privileged to cry “boo, hoo, hoo,” over Justin Seitz’s releasing the results of using OnionScan on over 8,000 Dark Web sites. You can find Justin’s dump here.

Joseph Cox writes in: Hacker Mass-Scans Dark Web Sites for Vulnerabilities, Dumps Results:

…Sarah Jamie Lewis, the creator of OnionScan, warns that publishing the full dataset like this may lead to some Tor hidden services being unmasked. In her own reports, Lewis has not pointed to specific sites or released the detailed results publicly, and instead only provided summaries of what she found.

“If more people begin publishing these results then I imagine there are a whole range of deanonymization vectors that come from monitoring page changes over time. Part of the reason I destroy OnionScan results once I’m done with them is because people deserve a chance to fix the issue and move on—especially when it comes to deanonymization vectors,” Lewis told Motherboard in an email, and added that she has, when legally able to, contacted some sites to help them fix issues quietly.

Sarah Jamie Lewis and others who seek to keep vulnerability data secret are making two assumptions:

  1. They should have exclusive access to data.
  2. Widespread access to data diminishes their power and privilege.

I agree only with #2 and it is the reason I support public and widespread distribution of data, all data.

Widespread access to data means it is your choices and abilities that determine its uses and not privilege of access.

BTW, Justin has the better of the exchange:

Seitz, meanwhile, thinks his script could be a useful tool to many people. “Too often we set the bar so high for the general practitioner (think journalists, detectives, data geeks) to do some of this larger scale data work that people just can’t get into it in a reasonable way. I wanted to give people a starting point,” he said.

“I am a technologist, so it’s the technology and resulting data that interest me, not the moral pros and cons of data dumping, anonymity, etc. I leave that to others, and it is a grey area that as an offensive security guy I am no stranger to,” he continued.

The question is: Do you want privileged access to data for Sarah Jamie Lewis and a few others or do you think everyone should have equal access to data?

I know my answer.

What’s yours?

Dark Web OSINT With Python and OnionScan: Part One

Saturday, July 30th, 2016

Dark Web OSINT With Python and OnionScan: Part One by Justin.

When you tire of what passes for political discussion on Twitter and/or Facebook this weekend, why not try your hand at something useful?

Like looking for data leaks on the Dark Web?

You could, in theory at least, notify the sites of their data leaks. 😉

One of the aspects of announced leaks that never ceases to amaze me are reports that read:

Well, we pawned the (some string of letters) database and then notified them of the issue.

Before getting a copy of the entire database? What’s the point?

All you have accomplished is making another breach more difficult and demonstrating your ability to breach a system where the root password was most likely “god.”

Anyway, Justin gets you started on seeking data leaks on the Dark Web saying:

You may have heard of this awesome tool called OnionScan that is used to scan hidden services in the dark web looking for potential data leaks. Recently the project released some cool visualizations and a high level description of what their scanning results looked like. What they didn’t provide is how to actually go about scanning as much of the dark web as possible, and then how to produce those very cool visualizations that they show.

At a high level we need to do the following:

  1. Setup a server somewhere to host our scanner 24/7 because it takes some time to do the scanning work.
  2. Get TOR running on the server.
  3. Get OnionScan setup.
  4. Write some Python to handle the scanning and some of the other data management to deal with the scan results.
  5. Write some more Python to make some cool graphs. (Part Two of the series)

Let’s get started!

Very much looking forward to Part 2!


A Study in News Verification

Friday, July 29th, 2016

Turkey, propaganda and eyewitness media: A case study in verification for news by Sam Dubberley.

I would amend Michael Garibaldi‘s line in Babylon 5: Exercise of Vital Powers (#4.16):

Everybody lies.

to read:

Everybody lies. [The question is why?]

No report (“true” or “false”) is made to you without motivation. The attempt to discern that motivation can improve your handling of such reports.

Sam’s account is a great illustration of taking the motivation for a report into account.

How-To Get Published In Scientific American

Friday, July 29th, 2016

Summarize the obvious:


“There are hackers, hackers I say that are breaking into computer systems!”

If the near omnipresence of hackers in all information systems surprises you, may I suggest that you join a survivalist community at your earliest opportunity?

The rest of the post summarizes the conclusion-rich but fact-poor popular opinions of US security contractors whose Magic-8 ball pointed towards Russia for this latest hacking incident.

Skip this article if you are looking for “scientific” content in Scientific American.

QRLJacking [July 28, 2016]

Thursday, July 28th, 2016

QRLJacking — Hacking Technique to Hijack QR Code Based Quick Login System by Swati Khandelwal.

I put today’s date in the title so several years from now when a “security expert” breathlessly reports on “terrorists” using QRLJcking, you can easily find that it has been in use for years.

For some reason, “security experts” fail to mention that governments, banks, privacy advocates and numerous others in all walks of life and business use cybersecure services. Maybe that’s not a selling point for them. You think?

In any event, Swati gives a great introduction to QRLJacking, starting with:

Do you know that you can access your WeChat, Line and WhatsApp chats on your desktop as well using an entirely different, but fastest authentication system?

It’s SQRL, or Secure Quick Response Login, a QR-code-based authentication system that allows users to quickly sign into a website without having to memorize or type in any username or password.

QR codes are two-dimensional barcodes that contain a significant amount of information such as a shared key or session cookie.

A website that implements QR-code-based authentication system would display a QR code on a computer screen and anyone who wants to log-in would scan that code with a mobile phone app.

Once scanned, the site would log the user in without typing in any username or password.

Since passwords can be stolen using a keylogger, a man-in-the-middle (MitM) attack, or even brute force attack, QR codes have been considered secure as it randomly generates a secret code, which is never revealed to anybody else.

But, no technology is immune to being hacked when hackers are motivated.

Following this post and the resources therein, you will be well prepared for when your usual targets decide to “upgrade” to SQRL, or Secure Quick Response Login.


PS: There is a well-known pattern in this attack, one that is true for other online security systems. Do you see it?

U.S. Climate Resilience Toolkit

Thursday, July 28th, 2016

Bringing climate information to your backyard: the U.S. Climate Resilience Toolkit by Tamara Dickinson and Kathryn Sullivan.

From the post:

Climate change is a global challenge that will requires local solutions. Today, a new version of the Climate Resilience Toolkit brings climate information to your backyard.

The Toolkit, called for in the President’s Climate Action Plan and developed by the National Oceanic and Atmospheric Administration (NOAA), in collaboration with a number of Federal agencies, was launched in 2014. After collecting feedback from a diversity of stakeholders, the team has updated the Toolkit to deliver more locally-relevant information and to better serve the needs of its users. Starting today, Toolkit users will find:

  • A redesigned user interface that is responsive to mobile devices;
  • County-scale climate projections through the new version of the Toolkit’s Climate Explorer;
  • A new “Reports” section that includes state and municipal climate-vulnerability assessments, adaptation plans, and scientific reports; and
  • A revised “Steps to Resilience” guide, which communicates steps to identifying and addressing climate-related vulnerabilities.

Thanks to the Toolkit’s Climate Explorer, citizens, communities, businesses, and policy leaders can now visualize both current and future climate risk on a single interface by layering up-to-date, county-level, climate-risk data with maps. The Climate Explorer allows coastal communities, for example, to overlay anticipated sea-level rise with bridges in their jurisdiction in order to identify vulnerabilities. Water managers can visualize which areas of the country are being impacted by flooding and drought. Tribal nations can see which of their lands will see the greatest mean daily temperature increases over the next 100 years.  

A number of decision makers, including the members of the State, Local, and Tribal Leaders Task Force, have called on the Federal Government to develop actionable information at local-to-regional scales.  The place-based, forward-looking information now available through the Climate Explorer helps to meet this demand.

The Climate Resilience Toolkit update builds upon the Administration’s efforts to boost access to data and information through resources such as the National Climate Assessment and the Climate Data Initiative. The updated Toolkit is a great example of the kind of actionable information that the Federal Government can provide to support community and business resilience efforts. We look forward to continuing to work with leaders from across the country to provide the tools, information, and support they need to build healthy and climate-ready communities.

Check out the new capabilities today at!

I have only started to explore this resource but thought I should pass it along.

Of particular interest to me is the integration of data/analysis from this resource with other data.


greek-accentuation 1.0.0 Released

Thursday, July 28th, 2016

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

Free & Interactive Online Introduction to LaTeX

Thursday, July 28th, 2016

Free & Interactive Online Introduction to LaTeX by John Lees-Miller.

From the webpage:

Part 1: The Basics

Welcome to the first part of our free online course to help you learn LaTeX. If you have never used LaTeX before, or if it has been a while and you would like a refresher, this is the place to start. This course will get you writing LaTeX right away with interactive exercises that can be completed online, so you don’t have to download and install LaTeX on your own computer.

In this part of the course, we’ll take you through the basics of how LaTeX works, explain how to get started, and go through lots of examples. Core LaTeX concepts, such as commands, environments, and packages, are introduced as they arise. In particular, we’ll cover:

  • Setting up a LaTeX Document
  • Typesetting Text
  • Handling LaTeX Errors
  • Typesetting Equations
  • Using LaTeX Packages

In part two and part three, we’ll build up to writing beautiful structured documents with figures, tables and automatic bibliographies, and then show you how to apply the same skills to make professional presentations with beamer and advanced drawings with TikZ. Let’s get started!

Since I mentioned fonts earlier today, Learning a Manifold of Fonts, it seems only fair to post about the only typesetting language that can take full advantage of any font you care to use.

TeX was released in 1978 and it has yet to be equaled by any non-TeX/LaTeX system.

It’s almost forty (40) years old, widely used and still sui generis.


Thursday, July 28th, 2016


From the webpage:

MorganaXProc is an implementation of W3C’s XProc: An XML Pipeline Language written in Java™. It is free software, released under GNU General Public License version 2.0 (GPLv2).

The current version is 0.95 (public beta). It is very close to the recommendation with all related tests of the XProc Test Suite passed.

News: MorganaXProc 0.95-11 released

You can follow <xml-project/> on Twitter: @xml_project and peruse their documentation.

I haven’t worked my way through A User’s Guide to MorganaXProc but it looks promising.


Entropy Explained, With Sheep

Thursday, July 28th, 2016

Entropy Explained, With Sheep by Aatish Bhatia.

Entropy is relevant to information theory, encryption, Shannon, but I mention it here because of the cleverness of the explanation.

Aatish sets a very high bar for taking a difficult concept and creating a compelling explanation that does not involve hand-waving and/or leaps of faith on the part of the reader.

Highly recommended as a model for explanation!


What That Election Probability Means
[500 Simulated Clinton-Trump Elections]

Thursday, July 28th, 2016

What That Election Probability Means by Nathan Yau.

From the post:

We now have our presidential candidates, and for the next few months you get to hear about the changing probability of Hillary Clinton and Donald Trump winning the election. As of this writing, the Upshot estimates a 68% probability for Clinton and 32% for Donald Trump. FiveThirtyEight estimates 52% and 48% for Clinton and Trump, respectively. Forecasts are kind of all over the place this far out from November. Plus, the numbers aren’t especially accurate post-convention.

But the probabilities will start to converge and grow more significant.

So what does it mean when Clinton has a 68% chance of becoming president? What if there were a 90% chance that Trump wins?

Some interpret a high percentage as a landslide, which often isn’t the case with these election forecasts, and it certainly doesn’t mean the candidate with a low chance will lose. If this were the case, the Cleveland Cavaliers would not have beaten the Golden State Warriors, and I would not be sitting here hating basketball.

Fiddle with the probabilities in the graphic below to see what I mean.

As always, visualizations from Nathan are a joy to view and valuable in practice.

You need to run it several times but here’s the result I got with “FiveThirtyEight estimates 52% and 48% for Clinton and Trump, respectively.”


You have to wonder what a similar simulation for breach/no-breach would look like for your enterprise?

Would that be an effective marketing tool for cybersecurity?

Perhaps not if you are putting insecure code on top of insecure code but there are other solutions.

For example, having state legislatures prohibit the operation of escape from liability clauses in EULAs.

Assuming someone who has read one in sufficient detail to draft legislation. 😉

That could be an interesting data project. Anyone have a pointer to a collection of EULAs?

Saxon-JS – Beta Release (EE-License)

Thursday, July 28th, 2016


From the webpage:

Saxon-JS is an XSLT 3.0 run-time written in pure JavaScript. It’s designed to execute Stylesheet Export Files compiled by Saxon-EE.

The first beta release is Saxon-JS 0.9 (released 28 July 2016), for use on web browsers. This can be used with Saxon-EE or later.

The beta release has been tested with current versions of Safari, Firefox, and Chrome browsers. It is known not to work under Internet Explorer. Browser support will be extended in future releases. Please let us know of any problems.

Saxon-JS documentation.

Goodies from the documentation:

Because people want to write rich interactive client-side applications, Saxon-JS does far more than simply converting XML to HTML, in the way that the original client-side XSLT 1.0 engines did. Instead, the stylesheet can contain rules that respond to user input, such as clicking on buttons, filling in form fields, or hovering the mouse. These events trigger template rules in the stylesheet which can be used to read additional data and modify the content of the HTML page.

We’re talking here primarily about running Saxon-JS in the browser. However, it’s also capable of running in server-side JavaScript environments such as Node.js (not yet fully supported in this beta release).

Grab a copy to get ready for discussions at Balisage!

Web Design in 4 minutes

Thursday, July 28th, 2016

Web Design in 4 minutes by Jeremy Thomas.

From the post:

Let’s say you have a product, a portfolio, or just an idea you want to share with everyone on your own website. Before you publish it on the internet, you want to make it look attractive, professional, or at least decent to look at.

What is the first thing you need to work on?

This is more for me than you, especially if you consider my much neglected homepage.

Over the years my blog has consumed far more of my attention than my website.

I have some new, longer material that is more appropriate for the website so this post is a reminder to me to get my act together over there!

Other web design resource suggestions welcome!

Learning a Manifold of Fonts

Thursday, July 28th, 2016

Learning a Manifold of Fonts by Neill D.F. Campbell and Jan Kautz.


The design and manipulation of typefaces and fonts is an area requiring substantial expertise; it can take many years of study to become a proficient typographer. At the same time, the use of typefaces is ubiquitous; there are many users who, while not experts, would like to be more involved in tweaking or changing existing fonts without suffering the learning curve of professional typography packages.

Given the wealth of fonts that are available today, we would like to exploit the expertise used to produce these fonts, and to enable everyday users to create, explore, and edit fonts. To this end, we build a generative manifold of standard fonts. Every location on the manifold corresponds to a unique and novel typeface, and is obtained by learning a non-linear mapping that intelligently interpolates and extrapolates existing fonts. Using the manifold, we can smoothly interpolate and move between existing fonts. We can also use the manifold as a constraint that makes a variety of new applications possible. For instance, when editing a single character, we can update all the other glyphs in a font simultaneously to keep them compatible with our changes.

To get a realistic feel for this proposal, try the interactive demo!

One major caveat:

In another lifetime, I contacted John Hudson of Tyro Typeworks about the development of the SBL Font series:


The origins of that project are not reflected on the SBL webpage, but the difference between John’s work and that of non-professional typographers is obvious even to untrained readers.

Nothing against experimentation with fonts but realize that for truly professional results, you need to hire professionals who live and breath the development of high quality fonts.

First Steps In The 30K Hillary Clinton Email Hunt

Wednesday, July 27th, 2016

No, no tips from “Russian hackers,” but rather from the fine staff at the Wall Street Journal (WSJ).

You may have heard of the WSJ. So far as I know, they have never been accused of collaboration with Russian hackers, Putin or the KGB.

Anyway, the WSJ posted: Get and analyze Hillary Clinton’s email, which reads in part as follows:

In response to a public records request, the U.S. State Department is releasing Hillary Clinton’s email messages from her time as secretary of state. Every month, newly released messages are posted to as PDFs, with some metadata.

This collection of tools automates downloading and helps analyze the messages. The Wall Steet Journal’s interactive graphics team uses some of this code to power our Clinton inbox search interactive.

Great step-by-step instructions on getting setup to analyze Clinton’s emails, with the one caveat that I had to change:

pip install virtualenv


sudo pip install virtualenv

With that one change, everything ran flawlessly on my Ubuntu 14.04 box.

Go ahead and get setup to analyze the emails.

Tomorrow: Clues from this data set to help in the hunt for the 30K deleted Hillary Clinton emails.

The Right to be Forgotten in the Media: A Data-Driven Study

Wednesday, July 27th, 2016

The Right to be Forgotten in the Media: A Data-Driven Study by , , , , .


Due to the recent “Right to be Forgotten” (RTBF) ruling, for queries about an individual, Google and other search engines now delist links to web pages that contain “inadequate, irrelevant or no longer relevant, or excessive” information about that individual. In this paper we take a data-driven approach to study the RTBF in the traditional media outlets, its consequences, and its susceptibility to inference attacks. First, we do a content analysis on 283 known delisted UK media pages, using both manual investigation and Latent Dirichlet Allocation (LDA). We find that the strongest topic themes are violent crime, road accidents, drugs, murder, prostitution, financial misconduct, and sexual assault. Informed by this content analysis, we then show how a third party can discover delisted URLs along with the requesters’ names, thereby putting the efficacy of the RTBF for delisted media links in question. As a proof of concept, we perform an experiment that discovers two previously-unknown delisted URLs and their corresponding requesters. We also determine 80 requesters for the 283 known delisted media pages, and examine whether they suffer from the “Streisand effect,” a phenomenon whereby an attempt to hide a piece of information has the unintended consequence of publicizing the information more widely. To measure the presence (or lack of presence) of a Streisand effect, we develop novel metrics and methodology based on Google Trends and Twitter data. Finally, we carry out a demographic analysis of the 80 known requesters. We hope the results and observations in this paper can inform lawmakers as they refine RTBF laws in the future.

Not collecting data prior to laws and policies seems to be a trademark of the legislative process.

Otherwise, the “Right to be Forgotten” (RTBF) nonsense that only impacts searching and then only in particular ways could have been avoided.

The article does helpfully outline how to discover delistings, of which they discovered 283 known delisted links.

Seriously? Considering that Facebook has 1 Billion+ users, much ink and electrons are being spilled over a minimum of 283 delisted links?

It’s time for the EU to stop looking for mites and mole hills to attack.

Especially since they are likely to resort to outright censorship as their next move.

That always ends badly.

The Hillary Clinton 30K Email Hunt – Defend Your Nation’s Honor – Enter Today!

Wednesday, July 27th, 2016

Would-be strongman (US President) Donald Trump insulted North Korean, Chinese, East European, to say nothing of American hackers today:

Donald J. Trump said Wednesday that he hoped Russia had hacked Hillary Clinton’s email, essentially encouraging an adversarial foreign power’s cyberspying on a secretary of state’s correspondence.

“Russia, if you’re listening, I hope you’re able to find the 30,000 emails that are missing,” Mr. Trump said, staring directly into the cameras. “I think you will probably be rewarded mightily by our press.”

(Donald Trump Calls on Russia to Find Hillary Clinton’s Missing Emails by Ashley Parker.)

Russia’s name has been thrown around recently, like the “usual suspects” in Casablanca, but that’s no excuse for Trump to insult other worthy hackers.

No slight to Russian hackers but an open competition between all hackers is the best way to find the 30K deleted Clinton emails.

Trump hasn’t offered a cash prize but think of the street cred you would earn for your nation/group!

Don’t limit yourself to the deleted emails.

Making Clinton’s campaign security the equivalent of an extreme string bikini results in bragging rights as well.

Gasp! “The Jihadists’ Digital Toolbox:…”

Tuesday, July 26th, 2016

The Jihadists’ Digital Toolbox: How ISIS Keeps Quiet on the Web by Jett Goldsmith.

From the post:

As the world dives deeper into the digital age, jihadist groups like ISIS and the Taliban have taken increasingly diverse measures to secure their communications and espouse their actions and ideas across the planet.

Propaganda has been a key measure of any jihadist group’s legitimacy since at least 2001, when al-Qaeda operative Adam Yahiye Gadahn established the media house As-Sahab, which was intended to spread the group’s message to a regional audience throughout Pakistan and Afghanistan.

Over the years, jihadist propaganda has taken a broader and more sophisticated tone. Al-Qaeda published the first issue of its digital newsmagazine, Inspire, in June of 2010. Inspire was aimed at an explicitly Western audience, and intended to call to jihad the would-be mujahideen throughout Europe and the United States.

When ISIS first took hold in Iraq and Syria, and formally declared its caliphate in the summer of 2014, the group capitalized on the groundwork laid by its predecessors and established an expansive, highly sophisticated media network to espouse its ideology. The group established local wilayat (provincial) media hubs, and members of its civil service distributed weekly newsletters, pamphlets, and magazines to citizens living under its caliphate. Billboards were posted in major cities under its control, including in Raqqah and Mosul; FM band radio broadcasts across 13 of its provinces were set up to deliver a variety of content, from fatwas and sharia lessons to daily news, poetry, and nasheeds; and Al-Hayat Media Center distributed its digital newsmagazine, Dabiq, in over a dozen languages to followers across the world.

Jeff covers:

  • Secure Browsers
  • Proxy Servers and VPNs
  • Propaganda Apps (read cellphone apps)
  • Encrypted Email
  • Mobile Privacy Apps
  • Encrypted Messages

That Jihadists or anyone else are using these tools maybe a surprise to some Fortune or Economist readers, but every conscious person associated with IT can probably name one or more instances for each category.

I’m sure some Jihadists drive cars, ride hoverboards, or bicycles, but dramatic recitations on those doesn’t advance a discussion of Jihadists or their goals.

Privacy software is a fact of life in all walks and levels of a digital environment.

Crying “Look! Over there! Someone might be doing something we don’t like!” isn’t going to lead to any useful answers, to anything. Including Jihadists.

PornHub Payday! $20,000!

Monday, July 25th, 2016

PornHub Pays Hackers $20,000 to Find Zero-day Flaws in its Website by Wang Wei.

From the post:

Cyber attacks get bigger, smarter, more damaging.

PornHub launched its bug bounty program two months ago to encourage hackers and bug bounty hunters to find and responsibly report flaws in its services and get rewarded.

Now, it turns out that the world’s most popular pornography site has paid its first bounty payout. But how much?

US $20,000!

Not every day that a porn site pays users!

While PHP has fixed the issue, be mindful there are plenty of unpatched versions of PHP in the wild.

Details of this attack can be found at: How we broke PHP, hacked Pornhub and earned $20,000 and Fuzzing Unserialize.

Any estimate of how many non-patched PHP installations are on sites ending in .gov or .com?

Accessing IRS 990 Filings (Old School)

Monday, July 25th, 2016

Like many others, I was glad to see: IRS 990 Filings on AWS.

From the webpage:

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present are available for anyone to use via Amazon S3.

Form 990 is the form used by the United States Internal Revenue Service to gather financial information about nonprofit organizations. Data for each 990 filing is provided in an XML file that contains structured information that represents the main 990 form, any filed forms and schedules, and other control information describing how the document was filed. Some non-disclosable information is not included in the files.

This data set includes Forms 990, 990-EZ and 990-PF which have been electronically filed with the IRS and is updated regularly in an XML format. The data can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. Forms 990-N (e-Postcard) are not available withing this data set. Forms 990-N can be viewed and downloaded from the IRS website.

I could use AWS but I’m more interested in deep analysis of a few returns than analysis of the entire dataset.

Fortunately the webpage continues:

An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.

All of the data is publicly accessible via the S3 bucket’s HTTPS endpoint at No authentication is required to download data over HTTPS. For example, the index file can be accessed at and the example filing mentioned above can be accessed at (emphasis in original).

I open a terminal window and type:


which as of today, results in:

-rw-rw-r-- 1 patrick patrick 1036711819 Jun 16 10:23 index.json

A trial grep:

grep "NATIONAL RIFLE" index.json > nra.txt

Which produces:

{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “396056607”, “SubmittedOn”: “2011-05-12”, “TaxPeriod”: “201012”, “FormType”: “990EZ”, “LastUpdated”: “2016-06-14T01:22:09.915971Z”, “OrganizationName”: “EAU CLAIRE NATIONAL RIFLE CLUB”, “IsElectronic”: false, “IsAvailable”: false},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

We have one errant result, the “EAU CLAIRE NATIONAL RIFLE CLUB,” so let’s delete that, re-order by year and the NATIONAL RIFLE ASSOCIATION OF AMERICA result reads (most recent to oldest):

{“EIN”: “530116130”, “SubmittedOn”: “2016-01-11”, “TaxPeriod”: “201412”, “DLN”: “93493259005035”, “LastUpdated”: “2016-04-29T13:40:20”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201532599349300503”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2014-11-25”, “TaxPeriod”: “201312”, “DLN”: “93493309004174”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201423099349300417”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2013-12-20”, “TaxPeriod”: “201212”, “DLN”: “93493260005203”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201302609349300520”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2012-12-06”, “TaxPeriod”: “201112”, “DLN”: “93493311011202”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201203119349301120”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},
{“EIN”: “530116130”, “SubmittedOn”: “2011-11-09”, “TaxPeriod”: “201012”, “DLN”: “93493270005081”, “LastUpdated”: “2016-03-21T17:23:53”, “URL”: “”, “FormType”: “990”, “ObjectId”: “201132709349300508”, “OrganizationName”: “NATIONAL RIFLE ASSOCIATION OF AMERICA”, “IsElectronic”: true, “IsAvailable”: true},

Of course, now you want the XML 990 returns, so extract the URLs for the 990s to a file, here nra-urls.txt (I would use awk if it is more than a handful):

Back to wget:

wget -i nra-urls.txt


-rw-rw-r– 1 patrick patrick 111798 Mar 21 16:12 201132709349300508_public.xml
-rw-rw-r– 1 patrick patrick 123490 Mar 21 19:47 201203119349301120_public.xml
-rw-rw-r– 1 patrick patrick 116786 Mar 21 22:12 201302609349300520_public.xml
-rw-rw-r– 1 patrick patrick 122071 Mar 21 15:20 201423099349300417_public.xml
-rw-rw-r– 1 patrick patrick 132081 Apr 29 10:10 201532599349300503_public.xml

Ooooh, it’s in XML! 😉

For the XML you are going to need: Current Valid XML Schemas and Business Rules for Exempt Organizations Modernized e-File, not to mention a means of querying the data (may I suggest XQuery?).

Once you have the index.json file, with grep, a little awk and wget, you can quickly explore IRS 990 filings for further analysis or to prepare queries for running on AWS (such as discovery of common directors, etc.).


Software Heritage – Universal Software Archive – Indexing/Semantic Challenges

Sunday, July 24th, 2016

Software Heritage

From the homepage:

We collect and preserve software in source code form, because software embodies our technical and scientific knowledge and humanity cannot afford the risk of losing it.

Software is a precious part of our cultural heritage. We curate and make accessible all the software we collect, because only by sharing it we can guarantee its preservation in the very long term.
(emphasis in original)

The project has already collected:

Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:

  • public, non-fork repositories from GitHub
  • source packages from the Debian distribution (as of August 2015, via the snapshot service)
  • tarball releases from the GNU project (as of August 2015)

We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.

The charters of the planned working groups:

Extending the archive

Evolving the archive

Connecting the archive

Using the archive

on quick review did not seem to me to address the indexing/semantic challenges that searching such an archive will pose.

If you are familiar with the differences in metacharacters between different Unix programs, that is only a taste of the differences that will be faced when searching such an archive.

Looking forward to learning more about this project!

Wikileaks Mentions In DNC Email – .000718%. Hillary To/From Emails – .000000% (RDON)

Saturday, July 23rd, 2016

Cryptome tweeted today:


Would you believe that Hillary Clinton is more irrelevant than Wikileaks?

Consider the evidence:

Search for at Search the DNC email database

Scrape the 533 results, as of Saturday, 23 July 2016, into a file.

Grep for and pipe that to another file.

Clean out the remaining markup, insert line returns for commas in cc: field, lowercase and sort, then uniq.


  1. – Adrienne K. Elrod
  2. – never a sender
  3. – Dennis Cheng
  4. – never a sender
  5. – Justin Klein
  6. – Josh Schwerin
  7. – Kathleen Gasperine
  8. – Lindsay Roitman
  9. – never a sender
  10. – Mary Rutherford Jennings
  11. – no author
  12. – 1 post, no sig
  13. – Zac Petkanas

That’s right! From January of 2015 until May of 2016, Hillary Clinton apparently had no emails to or from the DNC.

I find that to be unlikely to say the least.

What’s your explanation for the absence of Hillary Clinton emails to and from the DNC?

My explanation that Wikileaks is manipulating both the data and all of us.

Here’s a motto for data leaks: Raw Data Or Nothing (RDON)

Say it, repeat it, demand it – RDON!

Yes Luis, There Is A Fuck You Emoji

Friday, July 22nd, 2016

Luis Miranda, Communications Director of the DNC asks:


Yes, there is a Fuck You emoji!

For example, here is the Google version:


I don’t know if Luis is still looking for an answer to that question but if so, consider it answered!

Searching the DNC email database can be amusing, even educational as the question from Luis demonstrates, I would prefer the ability to browse and to download the dataset for deeper analysis.

What have you found in the DNC email database?

Write Chelsea Manning

Friday, July 22nd, 2016

Write Chelsea Manning

From the post:

Thank you for supporting WikiLeaks whistle-blower US Army Private Chelsea (formerly Bradley) Manning! You can write her today. As of April 23, 2014, a Kansas district judge has approved PVT Manning’s request for legal name change, and you can address your envelopes to her as “Chelsea E. Manning.”

Mail must be addressed exactly as follows:


Notes regarding this address:

  • Do not include a hash (“#”) in front of Manning’s inmate number.
  • Do not include any title in front of Manning’s name, such as “Ms.,” “Mr.,” “PVT,” “PFC,” etc.
  • Do not include any additional information in the address, such as “US Army” or “US Disciplinary Barracks.”
  • Do not modify the address to conform to USPS standards, such as abbreviating “North,” “Road,” “Fort,” or “Kansas.”
  • For international mail, either “USA” or “UNITED STATES OF AMERICA” are acceptable on a separate line.

What you can send Chelsea

Chelsea Manning is currently eligible to receive mail, including birthday or holiday cards, from anyone who wishes to write. You are also permitted to mail unframed photographs. …

I contacted the project and was advised that the best gift for Chelsea is:

…money order or cashiers check made out to “Chelsea E. Manning” and mailed to her postal address. These funds will be deposited into Chelsea’s prison account. She uses this account to make phone calls, purchase stamps, and buy other small comfort items not provided by the prison.

Let Chelsea know you appreciate her bravery and sacrifice!

Introspection For Your iPhone (phone security)

Thursday, July 21st, 2016

Against the Law: Countering Lawful Abuses of Digital Surveillance by Andrew “bunnie’ Huang and Edward Snowden.

From the post:

Front-line journalists are high-value targets, and their enemies will spare no expense to silence them. Unfortunately, journalists can be betrayed by their own tools. Their smartphones are also the perfect tracking device. Because of the precedent set by the US’s “third-party doctrine,” which holds that metadata on such signals enjoys no meaningful legal protection, governments and powerful political institutions are gaining access to comprehensive records of phone emissions unwittingly broadcast by device owners. This leaves journalists, activists, and rights workers in a position of vulnerability. This work aims to give journalists the tools to know when their smart phones are tracking or disclosing their location when the devices are supposed to be in airplane mode. We propose to accomplish this via direct introspection of signals controlling the phone’s radio hardware. The introspection engine will be an open source, user-inspectable and field-verifiable module attached to an existing smart phone that makes no assumptions about the trustability of the phone’s operating system.

If that sounds great, you have to love their requirements:

Our introspection engine is designed with the following goals in mind:

  1. Completely open source and user-inspectable (“You don’t have to trust us”)
  2. Introspection operations are performed by an execution domain completely separated from the phone’s CPU (“don’t rely on those with impaired judgment to fairly judge their state”)
  3. Proper operation of introspection system can be field-verified (guard against “evil maid” attacks and hardware failures)
  4. Difficult to trigger a false positive (users ignore or disable security alerts when there are too many positives)
  5. Difficult to induce a false negative, even with signed firmware updates (“don’t trust the system vendor” – state-level adversaries with full cooperation of system vendors should not be able to craft signed firmware updates that spoof or bypass the introspection engine)
  6. As much as possible, the introspection system should be passive and difficult to detect by the phone’s operating system (prevent black-listing/targeting of users based on introspection engine signatures)
  7. Simple, intuitive user interface requiring no specialized knowledge to interpret or operate (avoid user error leading to false negatives; “journalists shouldn’t have to be cryptographers to be safe”)
  8. Final solution should be usable on a daily basis, with minimal impact on workflow (avoid forcing field reporters into the choice between their personal security and being an effective journalist)

This work is not just an academic exercise; ultimately we must provide a field-ready introspection solution to protect reporters at work.

You need to copy those eight requirements out to a file for editing. When anyone proposes a cybersecurity solution, reword as appropriate as your user requirements.

An artist conception of what protection for an iPhone might look like:


Interested in protecting reporters and personal privacy? Follow Andrew ‘bunnie’ Huang’s blog.