Archive for September, 2016

Version 2 of the Hubble Source Catalog [Model For Open Access – Attn: Security Researchers]

Friday, September 30th, 2016

Version 2 of the Hubble Source Catalog

From the post:

The Hubble Source Catalog (HSC) is designed to optimize science from the Hubble Space Telescope by combining the tens of thousands of visit-based source lists in the Hubble Legacy Archive (HLA) into a single master catalog.

Version 2 includes:

  • Four additional years of ACS source lists (i.e., through June 9, 2015). All ACS source lists go deeper than in version 1. See current HLA holdings for details.
  • One additional year of WFC3 source lists (i.e., through June 9, 2015).
  • Cross-matching between HSC sources and spectroscopic COS, FOS, and GHRS observations.
  • Availability of magauto values through the MAST Discovery Portal. The maximum number of sources displayed has increased from 10,000 to 50,000.

The HSC v2 contains members of the WFPC2, ACS/WFC, WFC3/UVIS and WFC3/IR Source Extractor source lists from HLA version DR9.1 (data release 9.1). The crossmatching process involves adjusting the relative astrometry of overlapping images so as to minimize positional offsets between closely aligned sources in different images. After correction, the astrometric residuals of crossmatched sources are significantly reduced, to typically less than 10 mas. The relative astrometry is supported by using Pan-STARRS, SDSS, and 2MASS as the astrometric backbone for initial corrections. In addition, the catalog includes source nondetections. The crossmatching algorithms and the properties of the initial (Beta 0.1) catalog are described in Budavari & Lubow (2012).


There are currently three ways to access the HSC as described below. We are working towards having these interfaces consolidated into one primary interface, the MAST Discovery Portal.

  • The MAST Discovery Portal provides a one-stop web access to a wide variety of astronomical data. To access the Hubble Source Catalog v2 through this interface, select Hubble Source Catalog v2 in the Select Collection dropdown, enter your search target, click search and you are on your way. Please try Use Case Using the Discovery Portal to Query the HSC
  • The HSC CasJobs interface permits you to run large and complex queries, phrased in the Structured Query Language (SQL).
  • HSC Home Page

    – The HSC Summary Search Form displays a single row entry for each object, as defined by a set of detections that have been cross-matched and hence are believed to be a single object. Averaged values for magnitudes and other relevant parameters are provided.

    – The HSC Detailed Search Form displays an entry for each separate detection (or nondetection if nothing is found at that position) using all the relevant Hubble observations for a given object (i.e., different filters, detectors, separate visits).

Amazing isn’t it?

The astronomy community long ago vanquished data hoarding and constructed tools to avoid moving very large data sets across the network.

All while enabling more and not less access and research using the data.

Contrast that to the sorry state of security research, where example code is condemned, if not actually prohibited by law.

Yet, if you believe current news reports (always an iffy proposition), cybercrime is growing by leaps and bounds. (PwC Study: Biggest Increase in Cyberattacks in Over 10 Years)

How successful is the “data hoarding” strategy of the security research community?

Going My Way? – Explore 1.2 billion taxi rides

Friday, September 30th, 2016

Explore 1.2 billion taxi rides by Hannah Judge.

From the post:

Last year the New York City Taxi and Limousine Commission released a massive dataset of pickup and dropoff locations, times, payment types, and other attributes for 1.2 billion trips between 2009 and 2015. The dataset is a model for municipal open data, a tool for transportation planners, and a benchmark for database and visualization platforms looking to test their mettle.

MapD, a GPU-powered database that uses Mapbox for its visualization layer, made it possible to quickly and easily interact with the data. Mapbox enables MapD to display the entire results set on an interactive map. That map powers MapD’s dynamic dashboard, updating the data as you zoom and pan across New York.

Very impressive demonstration of the capabilities of MapD!

Imagine how you can visualize data from your hundreds of users geo-spotting security forces with their smartphones.

Or visualizing data from security forces tracking your citizens.

Technology cuts both ways.

The question is whether the sharper technology sword is going to be in your hands or those of your opponents?

Introducing the Open Images Dataset

Friday, September 30th, 2016

Introducing the Open Images Dataset by Ivan Krasin and Tom Duerig.

From the post:

In the last few years, advances in machine learning have enabled Computer Vision to progress rapidly, allowing for systems that can automatically caption images to apps that can create natural language replies in response to shared photos. Much of this progress can be attributed to publicly available image datasets, such as ImageNet and COCO for supervised learning, and YFCC100M for unsupervised learning.

Today, we introduce Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. We tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license*.

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API. For the validation set, we had human raters verify these automated labels to find and remove false positives. On average, each image has about 8 labels assigned. Here are some examples:

Impressive data set, if you want to recognize a muffin, gherkin, pebble, etc., see the full list at dict.csv.

Hopeful the techniques you develop with these images will lead to more focused image recognition. 😉

I lightly searched the list and no “non-safe” terms jumped out at me. Suitable for family image training.

ggplot2 2.2.0 coming soon! [Testers Needed!]

Friday, September 30th, 2016

ggplot2 2.2.0 coming soon! by Hadley Wickham.

From the post:

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:


ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.

Just in case you are casual about time, tomorrow is October 1st. Which on most calendars means that “early November” isn’t far off.

Here’s an easy opportunity to test ggplot2.2.2.0 and related visualization packages. Before the official release.


ORWL – Downside of a Physically Secure Computer

Friday, September 30th, 2016

Meet ORWL. The first open source, physically secure computer


If someone has physical access to your computer with secure documents present, it’s game over! ORWL is designed to solve this as the first open source physically secure computer. ORWL (pronounced or-well) is the combination of the physical security from the banking industry (used in ATMs and Point of Sale terminals) and a modern Intel-based personal computer. We’ve designed a stylish glass case which contains the latest processor from Intel – exactly the same processor as you would find in the latest ultrabooks and we added WiFi and Bluetooth wireless connectivity for your accessories. It also has two USB Type C connectors for any accessories you prefer to connect via cables. We then use the built-in Intel 515 HD Video which can output up to 4K video with audio.

The physical security enhancements we’ve added start with a second authentication factor (wireless keyfob) which is processed before the main processor is even powered up. This ensures we are able to check the system’s software for authenticity and security before we start to run it. We then monitor how far your keyfob is from your PC – when you leave the room, your PC will be locked automatically, requiring the keyfob to unlock it again. We’ve also ensured that all information on the system drive is encrypted via the hardware on which it runs. The encryption key for this information is managed by the secure microcontroller which also handles the pre-boot authentication and other security features of the system. And finally, we protect everything with a high security enclosure (inside the glass) that prevents working around our security by physically accessing hardware components.

Any attempt to get physical access to the internals of your PC will delete the cryptographic key, rendering all your data permanently inaccessible!

The ORWL is a good illustration that good security policies can lead to unforeseen difficulties.

Or as the blog post brags:

Any attempt to get physical access to the internals of your PC will delete the cryptographic key, rendering all your data permanently inaccessible!

All I need do to deprive you of your data (think ransomware), is to physically tamper with your ORWL.

Of interest to journalists who need the ability to deprive others of data on very short notice.

Perhaps a fragile version for journalists and a more resistance to abuse version for the average user.


Multiple Backdoors found in D-Link DWR-932 B LTE Router [There is an upside.]

Thursday, September 29th, 2016

Multiple Backdoors found in D-Link DWR-932 B LTE Router by Swati Khandelwal.

From the post:

If you own a D-Link wireless router, especially DWR-932 B LTE router, you should get rid of it, rather than wait for a firmware upgrade that never lands soon.

D-Link DWR-932B LTE router is allegedly vulnerable to over 20 issues, including backdoor accounts, default credentials, leaky credentials, firmware upgrade vulnerabilities and insecure UPnP (Universal Plug-and-Play) configuration.

If successfully exploited, these vulnerabilities could allow attackers to remotely hijack and control your router, as well as network, leaving all connected devices vulnerable to man-in-the-middle and DNS poisoning attacks.

Moreover, your hacked router can be easily abused by cybercriminals to launch massive Distributed Denial of Service (DDoS) attacks, as the Internet has recently witnessed record-breaking 1 Tbps DDoS attack that was launched using more than 150,000 hacked Internet-connected smart devices.

Security researcher Pierre Kim has discovered multiple vulnerabilities in the D-Link DWR-932B router that’s available in several countries to provide the Internet with an LTE network.

The current list on this cyber-horror at is £95.97. Wow!

Once word spreads about its swiss-cheese like security characteristics, one hopes its used price will fall rapidly.

Swati’s post makes the start of a great checklist for grading penetration of the router for exam purposes.


PS: I’m willing to pay $10.00 plus shipping for one. (Contact me for details.)

The Simpsons by the Data [South Park as well]

Thursday, September 29th, 2016

The Simpsons by the Data by Todd Schneider.

From the post:

The Simpsons needs no introduction. At 27 seasons and counting, it’s the longest-running scripted series in the history of American primetime television.

The show’s longevity, and the fact that it’s animated, provides a vast and relatively unchanging universe of characters to study. It’s easier for an animated show to scale to hundreds of recurring characters; without live-action actors to grow old or move on to other projects, the denizens of Springfield remain mostly unchanged from year to year.

As a fan of the show, I present a few short analyses about Springfield, from the show’s dialogue to its TV ratings. All code used for this post is available on GitHub.

Alert! You must run Flash in order to access Simpsons World, the source of Todd’s data.

Advice: Treat Flash as malware and run in a VM.

Todd covers the number of words spoken per character, gender imbalance, focus on characters, viewership, and episode summaries (tf-idf).

Other analysis awaits your imagination and interest.

BTW, if you want comedy data a bit closer to the edge, try Text Mining South Park by Kaylin Walker. Kaylin uses R for her analysis as well.

Other TV programs with R-powered analysis?

Graph Computing with Apache TinkerPop

Thursday, September 29th, 2016

From the description:

Apache TinkerPop serves as an Apache governed, vendor-agnostic, open source initiative providing a standard interface and query language for both OLTP- and OLAP-based graph systems. This presentation will outline the means by which vendors implement TinkerPop and then, in turn, how the Gremlin graph traversal language is able to process the vendor’s underlying graph structure. The material will be presented from the perspective of the DSEGraph team’s use of Apache TinkerPop in enabling graph computing features for DataStax Enterprise customers.


Marko is brutally honest.

He warns the early part of his presentation is stream of consciousness and that is the truth!


That takes you to time mark 11:37 and the description of Gremlin as a language begins.

Marko slows, momentarily, but rapidly picks up speed.

Watch the video, then grab the slides and mark what has captured your interest. Use the slides as your basis for exploring Gremlin and Apache TinkerPop documentation.


Are You A Moral Manipulator?

Thursday, September 29th, 2016

I appreciated Nir’s reminder about the #1 rule for drug dealers.

If you don’t know it, the video is only a little over six minutes long.


Election Prediction and STEM [Concealment of Bias]

Wednesday, September 28th, 2016

Election Prediction and STEM by Sheldon H. Jacobson.

From the post:

Every U.S. presidential election attracts the world’s attention, and this year’s election will be no exception. The decision between the two major party candidates, Hillary Clinton and Donald Trump, is challenging for a number of voters; this choice is resulting in third-party candidates like Gary Johnson and Jill Stein collectively drawing double-digit support in some polls. Given the plethora of news stories about both Clinton and Trump, November 8 cannot come soon enough for many.

In the Age of Analytics, numerous websites exist to interpret and analyze the stream of data that floods the airwaves and newswires. Seemingly contradictory data challenges even the most seasoned analysts and pundits. Many of these websites also employ political spin and engender subtle or not-so-subtle political biases that, in some cases, color the interpretation of data to the left or right.

Undergraduate computer science students at the University of Illinois at Urbana-Champaign manage Election Analytics, a nonpartisan, easy-to-use website for anyone seeking an unbiased interpretation of polling data. Launched in 2008, the site fills voids in the national election forecasting landscape.

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos. The methodologies used by Election Analytics include Bayesian statistics, which estimate the posterior distributions of the true proportion of voters that will vote for each candidate in each state, given both the available polling data and the states’ previous election results. Each poll is weighted based on its age and its size, providing a highly dynamic forecasting mechanism as Election Day approaches. Because winning a state translates into winning all the Electoral College votes for that state (with Nebraska and Maine using Congressional districts to allocate their Electoral College votes), winning by one vote or 100,000 votes results in the same outcome in the Electoral College race. Dynamic programming then uses the posterior probabilities to compile a probability mass function for the Electoral College votes. By design, Election Analytics cuts through the media chatter and focuses purely on data.

If you have ever taken a social science methodologies course then you know:

Election Analytics lets people see the current state of the election, free of any partisan biases or political innuendos.

is as false as anything uttered by any of the candidates seeking nomination and/or the office of the U.S. presidency since January 1, 2016.

It’s an annoying conceit when you realize that every poll is biased, however clean the subsequent number crunching of the numbers may be.

Bias one step removed isn’t the absence of bias, but the concealment of bias.

Meet Apache Spot… [Additional Malware Requirement: Appear Benign]

Wednesday, September 28th, 2016

Meet Apache Spot, a new open source project for cybersecurity by Katherine Noyes.

From the post:

Hard on the heels of the discovery of the largest known data breach in history, Cloudera and Intel on Wednesday announced that they’ve donated a new open source project to the Apache Software Foundation with a focus on using big data analytics and machine learning for cybersecurity.

Originally created by Intel and launched as the Open Network Insight (ONI) project in February, the effort is now called Apache Spot and has been accepted into the ASF Incubator.

“The idea is, let’s create a common data model that any application developer can take advantage of to bring new analytic capabilities to bear on cybersecurity problems,” Mike Olson, Cloudera co-founder and chief strategy officer, told an audience at the Strata+Hadoop World show in New York. “This is a big deal, and could have a huge impact around the world.”

Essentially, it uses machine learning as a filter to separate bad traffic from benign and to characterize network traffic behavior. It also uses a process including context enrichment, noise filtering, whitelisting and heuristics to produce a shortlist of most likely security threats.

Given the long tail for patch application, Prioritizing Patch Management Critical to Security, which reads in part:

Patch management – two words that are vital to cybersecurity, but that rarely generate enough attention.

That lack of attention can cost. Recent stats from the Verizon Data Breach report showed that many of the most exploited vulnerabilities in 2014 were nearly a decade old, and some were even more ancient than that. Additional numbers from the NTT Group 2015 Global Threat Intelligence Report revealed that 76 percent of vulnerabilities they observed on enterprise networks in 2014 were two years old or more.

Apache Spot is not an immediate threat to hacking success, but that’s no reason to delay sharpening your malware skills.

Beyond making malware seem benign, have you considered making normal application traffic seem rogue?

When security becomes “too burdensome,” uninformed decision makers may do more damage than hackers.

I know machine learning has improved but I find the use case:


at the very best, implausible. 😉

Thoughts on a test environment to mimic target networks?


Oversight Concedes Too Much

Wednesday, September 28th, 2016

It’s deeply ironic that the Electronic Frontier Foundation writes in: Police Around the Country Regularly Abuse Law Enforcement Databases:

The AP investigation builds off more than a year’s worth of research by EFF into the California Law Enforcement Telecommunications System (CLETS). EFF previously found that the oversight body charged with combatting misuse had been systematically giving law enforcement agencies a pass by either failing to make sure agencies filed required misuse data or to hold hearings to get to the bottom of persistent problems with misuse. As EFF reported, confirmed misuse cases have more than doubled in California between 2010 and 2015.

Contrast that post with:

NSA’s Failure to Report Shadow Broker Vulnerabilities Underscores Need for Oversight and What to Do About Lawless Government Hacking and the Weakening of Digital Security, both of which are predicated on what? Oversight.

Sorry, it is one of those “facts” everyone talks about in the presidential debates that both the Senate select Committee on Intelligence and the House Permanent Select Committee on Intelligence have been, are and in all likelihood will be, failures in terms of oversight of intelligence agencies. One particularly forceful summary of those failures can be found in: A Moon Base, Cyborg Army, and Congress’s Failed Oversight of the NSA by Eli Sugarman.

Eli writes:

Does the U.S. government have a moon base? How about a cyborg army? These questions were not posed by Stephen Colbert but rather by Rep. Justin Amash (R-MI) to highlight the futility of Congress’s intelligence oversight efforts. Amash decried how Congress is unable to reign in troubling NSA surveillance programs because it is not adequately informed about them or permitted to share the minimal information it does know. Congress is instead forced to tease out nuggets of information by playing twenty questions with uncooperative intelligence officials in classified briefings.

Oversight? When the overseen decide if, when, where and how much they will disclose to the overseers?

The EFF and others need to stop conceding the legitimacy of government surveillance and abandon its quixotic quest for implementation of a strategy, oversight, which is known to fail.

For anyone pointing at the latest “terrorism” attack in New York City, consider these stats from the Center for Disease Control (CDC, 2013):

Number of deaths for leading causes of death:

  • Heart disease: 614,348
  • Cancer: 591,699
  • Chronic lower respiratory diseases: 147,101
  • Accidents (unintentional injuries): 136,053
  • Stroke (cerebrovascular diseases): 133,103
  • Alzheimer’s disease: 93,541
  • Diabetes: 76,488
  • Influenza and Pneumonia: 55,227
  • Nephritis, nephrotic syndrome and nephrosis: 48,146
  • Intentional self-harm (suicide): 42,773

Do you see terrorism on that list?

Just so you know, toddlers with guns kill more people in the United States than terrorists.

Without terrorism, one of the knee-jerk justifications for government surveillance vanishes.

The EFF should be challenging the factual basis of government justifications for surveillance one by one.

Conceding that any justification for surveillance exists without contesting its factual basis is equivalent to conceding the existence of an unsupervised surveillance state.

Once surveillance is shown to have no factual justification, then the dismantling of the surveillance state can begin.

Bank Being Held Hostage (or rather its data)

Wednesday, September 28th, 2016

DarkNet Hackers ‘DarkOverlord’ Hack WestPark Capital Bank for Ransom tells a tale of secret/sensitive bank information being stolen and then the bank is threatened with its release, unless ransom is paid.

The hackers have dropped a “sample” of sensitive information, one assumes to prove the hack but also as incentive for WestPark Capital Bank to make payment.

I mention the story because the strategy of the hackers in releasing information to the public about the hack seems like an odd strategy.

Contrast “holding” a copy of data with the recent spate of ransomware hacks, were victims are denied access to their data at all. The absence of being able to conduct their regular business provides a powerful incentive for payment of a ransom.

“Holding” a copy of a bank’s data in no way impairs their day to day operations. Considering the “normal” activities of banks, shaming for poor security, or anything else, is an unlikely lever to use against a bank.

Clearly a direct payment from WestPark Capital Bank is the preferred solution of ‘DarkOverLord.’

But you have to ask yourself, does WestPark Capital Bank or its customers have greater incentives to prevent release of the data?

Customers of WestPark Capital Bank need to assess their risk of civil and criminal liability from documents held by WestPark and act in their own best interests.

Six lessons from a five-year FOIA battle [Cheat Sheet]

Wednesday, September 28th, 2016

Six lessons from a five-year FOIA battle by Philip Eil

From the post:

I FILED MY FIRST Freedom of Information Act request on February 1, 2012. I was 26 years old, and chasing a story about my father’s med-school classmate, Dr. Paul Volkman, who had been convicted of a massive prescription drug dealing scheme the previous year. The aim of the request was simple: I wanted to see the evidence the jury saw during Volkman’s eight-week trial in Cincinnati for a book I’m writing about the case. (Volkman went to college and medical school with my father.) But everyone I asked—the US district court clerk, the appellate court clerk, the prosecutor, and the judge who presided over the case—declined to give me the documents. It was time to make an official request to the Department of Justice.

To make a very long story short, in March of 2015, that FOIA request turned into my first FOIA lawsuit. And, earlier this month, I received my first FOIA judgment, which, I’m happy to report, is also my first FOIA-lawsuit victory. In a 17-page decision, US District Court Judge Jack McConnell cited Serial and Making A Murderer, wrote “Public scrutiny of judicial proceedings produces a myriad of social benefits,” and ordered the Drug Enforcement Administration to fork over the requested documents within 60 days.

It took more than four and half years to receive that judgment. And, during that time I began joking that I had attended “FOIA University.” Today, the prospect of postgraduate study looms; if the government appeals, it could extend this ordeal by months, if not years. But, with Judge McConnell’s decision in hand, I’d like to share a few of the things I’ve learned. What follows is the cheat-sheet I wish I someone had handed me five years ago.
… (emphasis in the original)

Not only will you find Philip’s “cheat sheet” useful, it is also inspirational.


How media coverage of terrorism endorses a legal [4-Ply or More] standard

Tuesday, September 27th, 2016

How media coverage of terrorism endorses a legal double standard by Rafia Zakaria.

From the post:

On June 17, 2016, Dylann Roof entered a predominantly black church In Charleston, South Carolina, and opened fire. When he was done, nine people lay dead around him. For a few days after Roof’s grisly act, a debate raged in the media over whether the committed white supremacist and mass murderer should be considered a terrorist. Many, including The Washington Post’s Philip Bump, vehemently opposed the label, insisting that even though the Justice Department had dubbed Roof’s killing spree “an act of domestic terrorism,” calling Roof a terrorist would confer upon him the very notoriety he sought.

Like other journalists and analysts, Bump analyzed the sociological and ethical dimensions of the terror label, concerns about whether all who terrify are terrorists, and whether the wider application of the label somehow lessens the potency of the evil it represents. However, like nearly all other journalists who write about terrorism, Bump missed the most crucial point concerning the media’s use of the term: that American law does not currently recognize “domestic terror” as a crime. For an act, however bloody and hateful, to be considered terrorism in the United States, it must be connected to a “foreign” terror organization.

Rafia makes an important point about the “pass” being given to white supremacists, while law abiding Muslims are viewed with suspicion if not being actively persecuted in the United States.

But Rafia misses the opportunity to point to the more than double standard in place for use of “terrorism” and “terrorist.”

What label other than “terrorist” would you apply to the unknown military personnel who attack a known hospital? It has been alleged those responsible have been punished, but then without transparency, how do we know?

Or even the garden variety cruise missile or drone attacks that end the lives of innocents with every strike. Aren’t those acts of terrorism?

Or does “terrorism” require a non-U.S. government actor?

Does that mean only the U.S. government?

How “terrorized” would you be by a phone call followed a “knock” by a missile on your roof, ordering you to leave immediately?

The claim that is “designed to minimize civilian casualties,” sounds like a quote from a modern day Marquis de Sade.

A little introspection by the media could explode the dishonest and manipulative use of the labels “terrorist” and “terrorism.”

Let’s hope that happens sooner rather than later.

Reinforcement Learning: An Introduction

Tuesday, September 27th, 2016

Reinforcement Learning: An Introduction, Second edition by Richard S. Sutton and Andrew G. Barto.

From Chapter 1:

The idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly a major source of knowledge about our environment and ourselves. Whether we are learning to drive a car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and we seek to influence what happens through our behavior. Learning from interaction is a foundational idea underlying nearly all theories of learning and intelligence.

In this book we explore a computational approach to learning from interaction. Rather than directly theorizing about how people or animals learn, we explore idealized learning situations and evaluate the effectiveness of various learning methods. That is, we adopt the perspective of an artificial intelligence researcher or engineer. We explore designs for machines that are effective in solving learning problems of scientific or economic interest, evaluating the designs through mathematical analysis or computational experiments. The approach we explore, called reinforcement learning, is much more focused on goal-directed learning from interaction than are other approaches to machine learning.

When this draft was first posted, it was so popular a download that the account was briefly suspended.

Consider that as an indication of importance.



Collaboration Tools and smart use of Google (ask Pippa Middleton)

Tuesday, September 27th, 2016

Collaboration Tools and smart use of Google by Kaas & Mulvad.

As Kaas & Mulvad illustrate, collaboration with Google tools can be quite effective.

However, my attention was caught by the last sentence of their first paragraph:

Google Drive makes sharing your files simple. It also allows multiple people to edit the same file, allowing for real-time collaboration. But be aware – don’t share anything in Google, you want to keep secret. (emphasis added)

Pippa Middleton would tell you the same advice applies to the iCloud.

Bulk Access to the Colin Powell Emails – Update

Tuesday, September 27th, 2016

Still working on finding a host for the 2.5 GB tarred, gzipped archive of the Colin Powell emails.

As an alternative, working on splitting the attachments (the main source of bulk) from the emails themselves.

My thinking at this point is to produce a message-only version of the emails. Emails with attachments will have auto-generated links to the source emails at

Other processing is planned for the message-only version of the emails.

Anyone interested in indexing the attachments? Generating lists of those with pointers shouldn’t be a problem.

Hope to have more progress to report tomorrow!

Bulk Access to the Colin Powell Emails

Monday, September 26th, 2016

The Colin Powell Email leak is important, but if you visit the DCLeaks page for Powell emails, June, July and August of 2014, this is what you find:


If you attempt to use the “search” box, you discover that your search is limited to June, July and August of 2014.

Then you remember the main page:


Which means every search must be repeated thirteen (13) times to find all relevant emails.

The phone is ringing, your pager is going off, emails and IMs are piling up and your on deadline. How useful is this interface to you as a reporter?

Have your own methods for processing large leaks of documents?

Not relevant here because access the Powell emails is one email at a time.

Put your drinking straw into a lake of 29,641 emails.

Best of luck with that drinking straw approach.

I’m suggesting a different approach.

What if someone automated that drinking straw and created a mirrored set of those 29,641 emails? Along with correcting the twelve (12) emails that chocked a .eml to .mbox converter.


Hosting Request: The full data set runs 2.5 GB, which, if popular, is far more traffic than I can support.

Requirements for hosting:

  1. Distribute the file as delivered to you.
  2. Distribute the file for free.

If you are interested, drop me a line at:

Warning: I have not checked the files or their attachments for malware, hostile links, etc. Open untrusted files in VMs without network connections. At a minimum.

Test your interest against the emails for March-April of 2016: powell-sample.tar.gz. (roughly 108MB)

Manipulation, enhancement and analysis of samples and the full set to follow.

Value-Add Of Wikileaks Hillary Clinton Email Archive?

Monday, September 26th, 2016

I was checking Wikileaks today for any new document drops on Hillary Clinton, but only found:

WikiLeaks offers award for #LabourLeaks

Trade in Services Agreement

Assange Medical and Psychological Records

The lesson from the last item is to always seek asylum in a large embassy, preferably one with a pool. You can search at Embassies by embassy for what country, located in what other country. I did not see an easy way to search for size and accommodations.

Oh, not finding any new data on Hillary Clinton, I checked the Hillary Clinton Email Archive at Wikileaks:


Compare that to the State Department FOIA server for Clinton_Email:


Do you see a value-add to Wikileaks re-posting the State Department’s posting of Hillary’s emails?

If yes, please report in comments below the value-add you see. (Thanks in advance.)

If not, what do you think would be a helpful value-add to the Hillary Clinton emails? (Suggestions deeply appreciated.)

20 Year Lesson On Operational Security

Monday, September 26th, 2016

Reports on Ardit Ferizi share a common lead:

A computer hacker who allegedly helped the terrorist organization ISIS by handing over data for 1,351 US government and military personnel has been sentenced to 20 years in a U.S. prison. (Hacker Who Helped ISIS to Build ‘Hit List’ Of US Military Personnel Jailed for 20 Years

An ISIS supporter who hit the headlines after breaking into computer systems in order to steal and leak the details of military personnel has been awarded a sentence of 20 years in prison for his crimes. (Hacker who leaked US military ‘kill list’ for ISIS sent behind bars)

A 20-year-old computer science student from Kosovo described by the Justice Department as “the first terrorist hacker convicted in the United States” was sentenced Friday to two decades in prison for providing the Islamic State with a “kill list” containing the personal information of roughly 1,300 U.S. military members and government employees. (Islamic State hacker sentenced for assisting terrorist group with ‘kill list’)

Missing from those leads (and most stories) is that bad operational security led to Ardit Ferizi’s arrest and conviction.

Charlie Osborne reports in Hacker who leaked US military ‘kill list’ for ISIS sent behind bars:

Ferizi gave this information to the terrorist organization in order for ISIS to “hit them hard” and did not bother to conceal his activity — neither disguising his IP address or using a fake name on social media — which made it easier for law enforcement to track his activities.

Charlie also reports the obligatory blustering of the Assistant Attorney General:

“This case represents the first time we have seen the very real and dangerous national security cyber threat that results from the combination of terrorism and hacking. This was a wake-up call not only to those of us in law enforcement, but also to those in private industry. This successful prosecution also sends a message to those around the world that, if you provide material support to designated foreign terrorist organizations and assist them with their deadly attack planning, you will have nowhere to hide.

We will reach half-way around the world if necessary to hold accountable those who engage in this type of activity.”

A “wake-up call” about computer science students with histories of drug abuse and mental health issues, who don’t practice even minimal operational security, yet who are “…very real and dangerous national security cyber threat[s]…”

You bet.

A better lead for this story would be:

Failure to conceal his IP and identity online nets Kosovo student a 20-year prison sentence in overreaching US prosecution, presided over by callous judge.

Concealment of IP and identity should be practiced until it is second nature.

No identification = No prosecution.

Colin Powell Email Files

Sunday, September 25th, 2016 posted on September 14, 2016, a set of emails to and from Colin Luther Powell.

From the homepage for those leaked emails:

Colin Luther Powell is an American statesman and a retired four-star general in the United States Army. He was the 65th United States Secretary of State, serving under U.S. President George W. Bush from 2001 to 2005, the first African American to serve in that position. During his military career, Powell also served as National Security Advisor (1987–1989), as Commander of the U.S. Army Forces Command (1989) and as Chairman of the Joint Chiefs of Staff (1989–1993), holding the latter position during the Persian Gulf War. Born in Harlem as the son of Jamaican immigrants, Powell was the first, and so far the only, African American to serve on the Joint Chiefs of Staff, and the first of two consecutive black office-holders to serve as U.S. Secretary of State.

The leaked emails start in June of 2014 and end in August of 2016.

Access to the emails is by browsing and/or full text searching.

Try your luck at finding Powell’s comments on Hillary Clinton or former Vice-President Cheney. Searching one chunk of emails at a time.

I appreciate and admire DCLeaks for taking the lead in posting this and similar materials. And I hope they continue to do so in the future.

However, the access offered reduces a good leak to a random trickle.

This series will use the Colin Powell emails to demonstrate better leaking practices.

Coming Monday, September 26, 2016 – Bulk Access to the Colin Powell Emails.

What are we allowed to say? [Criticism]

Saturday, September 24th, 2016

What are we allowed to say? by David Bromwich.

From the post:

Free speech is an aberration – it is best to begin by admitting that. In most societies throughout history and in all societies some of the time, censorship has been the means by which a ruling group or a visible majority cleanses the channels of communication to ensure that certain conventional practices will go on operating undisturbed. It is not only traditional cultures that see the point of taboos on speech and expressive action. Even in societies where faith in progress is part of a common creed, censorship is often taken to be a necessary means to effect improvements that will convey a better life to all. Violent threats like the fatwa on Salman Rushdie and violent acts like the assassinations at Charlie Hebdo remind us that a militant religion is a dangerous carrier of the demand for the purification of words and images. Meanwhile, since the fall of Soviet communism, liberal bureaucrats in the North Atlantic democracies have kept busy constructing speech codes and guidelines on civility to soften the impact of unpleasant ideas. Is there a connection between the two?

Probably an inbred trait of human nature renders the attraction of censorship perennial. Most people (the highly literate are among the worst) believe that what is good for them will be good for others. Besides, a regime of censorship must claim to derive its authority from settled knowledge and not opinion. Once enforcement and exclusion have done their work, this assumption becomes almost irresistible; and it is relied on to produce a fortunate and economical result: self-censorship. We stay out of trouble by gagging ourselves. Among the few motives that may strengthen the power of resistance is the consciousness of having been deeply wrong oneself, either regarding some abstract question or in personal or public life. Another motive of resistance occasionally pitches in: a radical, quasi-physical horror of seeing people coerce other people without having to supply reasons. For better or worse, this second motive is likely to be mixed with misanthropy.

As far back as one can trace the vicissitudes of public speech and its suppression, the case for censorship seems to have begun in the need for strictures against blasphemy. The introductory chapter of Blasphemy, by the great American legal scholar Leonard Levy, covers ‘the Jewish trial of Jesus’; it is followed in close succession, in Levy’s account, by the Christian invention of the concept of heresy and the persecution of the Socinian and Arminian heretics and later of the Ranters, Antinomians and early Quakers. After an uncertain interval of state prosecutions and compromises in the 19th century, Levy’s history closes at the threshold of a second Enlightenment in the mid-20th: the endorsement by the North Atlantic democracies of a regime of almost unrestricted freedom of speech and expression.
… (emphasis in original)

Bromwich’s essay runs some twenty pages in print so refresh your coffee before starting!

It is a “must” read but not without problems.

The focus on Charlie Hebdo and The Satanic Verses, gives readers a “safe context” in which to consider the issue of “free speech.”

The widespread censorship of “jihadist” speech, which for the most part passes unnoticed and without even a cursory node towards “free speech” is a more current and confrontational example.

Does Bromwich use safe examples to “stay out of trouble by gagging [himself]?”

Hundreds of thousands have been silenced by Western tech companies. Yet in an essay on freedom of speech, they don’t merit a single mention.

The failure to mention the largest current example of anti-freedom of speech in a freedom of speech essay, should disturb every attentive reader.

Disturb them to ask: What of freedom of speech today? Not as a dry and desiccated abstraction but freedom of speech in the streets.

Where is the freedom of speech to incite others to action? Freedom of speech to oppose corrupt governments? Freedom of speech to advocate harsh measures against criminal oppressors?

The invocation of Milton and Mill provides a groundwork for confrontation of government urged if not required censorship but the opportunity is wasted on the vagaries of academic politics.

Freedom of speech is important on college campuses but people are dying where freedom of speech is being denied. To showcase the former over the latter is a form of censorship itself.

If the question is censorship, as Milton and Mill would agree, the answer is no. (full stop)

PS: For those who raise the bugaboo of child pornography, there are laws against the sexual abuse of children, laws that raise no freedom of speech issues.

Possession of child pornography is attacked because it gives the appearance of meaningful action, while allowing the cash flow from its production and distribution to continue unimpeded.

Police use-of-force data is finally coming to light (Evidence Based Citizen Safety)

Saturday, September 24th, 2016

Police use-of-force data is finally coming to light by Megan Rose Dickey.

From the post:


Since 2011, less than 3% of the country’s 18,000 state and local police agencies have reported information about police-involved shootings of citizens. That’s because there’s no mandatory federal requirement to do so. There is, however, a mandate in California (Assembly Bill 71) for all police departments to report their use of force incidents that happened after Jan. 1, 2016 by Jan. 1, 2017.

Winds of Data Change:

With URSUS, California police departments can use the open-source platform to collect and report use-of-force data, in the cases of serious injuries, to the CA DOJ. Back in February, the CA DOJ unveiled a revamped version of the OpenJustice platform featuring data around arrest rates, deaths in custody, arrest-related deaths and law enforcement officers assaulted on the job.

Unlike the first version of OpenJustice, the current platform makes it possible to break down data by specific law enforcement agencies. As URSUS collects data about police use-of-force, OpenJustice will publish that information in its database starting early next year.

Here’s an overview of how the system works:

Evidence Based Citizen Safety

In the analysis, Campaign Zero found that only 21 of the 91 police departments reviewed explicitly prohibit officers from using chokeholds. Even more, the average police department reviewed has only adopted three of the eight policies identified that could prevent unnecessary civilian deaths. Not one of the police departments reviewed has implemented all eight.


According to Campaign Zero’s analysis, if the police departments reviewed were to implement all eight of the use-of-force restrictions, there would be a 54% reduction in killings for the average police department.

With the CA DOJ’s new police use-of-force data system, plus initiatives driven by non-profit organizations and the media, we’re definitely moving in the right direction when it comes to transparency around policing. But if we want real change, the rest of the country’s law enforcement agencies are going to need to get on board. If the PRIDE Act passes, police departments nationwide will not only have to make their use-of-force policies publicly available, but also have to report police use-of-force incidents that result in deaths of civilians. But while the government is stepping up its game around policing data, there is still a need for a community-driven initiatives that track police killings of civilians.

Greater transparency around policing leads to fewer civilian deaths (those folks the police are sworn to serve) and can lead to greater trust/cooperation between the police and the communities they serve. Which means better police work and less danger/stress for police officers.

That’s a win-win situation.

But it starts with data transparency for police activities.

How transparent is your local police department?

Waiting for it to be required by law delays better service to the community and better policing.

Is that a goal of your local police department? You might better ask.

XQuery Working Group (Vanderbilt)

Saturday, September 24th, 2016

XQuery Working Group – Learn XQuery in the Company of Digital Humanists and Digital Scientists

From the webpage:

We meet from 3:00 to 4:30 p.m. on most Fridays in 800FA of the Central Library. Newcomers are always welcome! Check the schedule below for details about topics. Also see our Github repository for code samples. Contact Cliff Anderson with any questions or see the FAQs below.

Good thing we are all mindful of the distinction W3C XML Query Working Group and XQuery Working Group (Vanderbilt).

Otherwise, you might need a topic map to sort out casual references. 😉

Even if you can’t attend meetings in person, support this project by Cliff Anderson.

Stress-Free #SkippingTheDebate Parties

Saturday, September 24th, 2016

Unlike the hackers only show in Snow Crash, the first presidential debate of 2016 is projected to make a record number of viewers dumber.

Well, to be fair, Stephen Battaglio, reports:

Millions of viewers are also expected to watch online as many websites and social media platforms, such as Facebook and Twitter, will offer free video streaming of the event.

“This one seems to have aroused the greatest attention and more debate-before-the-debate than any of them,” said Newton Minow, vice chairman of the Commission on Presidential Debates, whose involvement goes back to the the first historic televised showdown between John F. Kennedy and Richard Nixon in 1960.

The reason viewing levels may skyrocket? Larry Sabato, director of the Center for Politics at the University of Virginia, cites the unpredictability of Trump, whose appearances in the Republican primary debates set audience records on four different cable networks over the past year.

“It’s the same reason why this election is different than all other elections,” Sabato said. “People will tune in to see the car crash. Trump’s gotten a big audience from the beginning because you knew you’d either see a fender bender or a fatality. This is the big stage and the first one-on-one debate he’s done.”

Rather than watch a moderator and the candidates indulge in the fiction that any substantive discussion of national or international issues can occur in ninety minutes, hold a #SkippingTheDebate party!

Here’s how:

  1. Invite your friends over for a #SkippingTheDebate Party
  2. Have ball game like snacks and drinks
  3. Have a minimum of 5 back issues of Mad Magazine for each guest
  4. Distribute the Mad magazines, after 10 minute reading intervals, each guest may share their favorite comment or observation, discuss and repeat

Unlike debate watching parties, your guests will be amused, have more quips in their quivers, have enjoyed each others company, and most importantly, they will not be dumber for the experience.

I do have data to demonstrate that Mad Magazine is the right choice for your #SkippingTheDebate party:



Mad didn’t quite capture the dried apricot complexion of Trump and Hillary looks, well, younger, but even Mad can be kind.

Avoid FBI Demands – Make Your Product Easily Crackable

Friday, September 23rd, 2016

Joshua Kopstein reports that Apple has discovered a way to dodge future requests for assistance from the FBI.

Make backups of the iOS 10 easily crackable.

From iOS 10 Has a ‘Severe’ Security Flaw, Says iPhone-Cracking Company:

Apple has introduced a “severe” flaw in its newly-released iOS 10 operating system that leaves backup data vulnerable to password-cracking tools, according to researchers at a smartphone forensics company that specializes in unlocking iPhones.

In a blog post published Friday by Elcomsoft, a Russian company that makes software to help law enforcement agencies access data from mobile devices, researcher Oleg Afonin showed that changes in the way local backup files are protected in iOS 10 has left backups dramatically more susceptible to password-cracking attempts than those produced by previous versions of Apple’s operating system.

Specifically, the company found that iOS 10 backups saved locally to a computer via iTunes allow password-cracking tools to try different password combinations at a rate of 6,000,000 attempts per second, more than 40 times faster than with backups created by iOS 9. Elcomsoft says this is due to Apple implementing a weaker password verification method than the one protecting backup data in previous versions. That means that cops and tech-savvy criminals could much more quickly and easily gain access to data from locally-stored iOS 10 backups than those produced by older versions.

After the NSA sat on a Cisco vulnerability for a decade or so, you have to wonder about the motives of Elcomsoft for quick disclosure.

Perhaps they wanted to take away an easy win from their potential competitors?

In any event, be aware that your iOS 10 has a vulnerability the size of a Mack truck.

Got any Russian readers, that’s roughly the equivalent to:


While looking for this image, I saw a number of impressive Russian trucks!

14 free digital tools that any newsroom can use

Friday, September 23rd, 2016

14 free digital tools that any newsroom can use by Sara Olstad.

From the post:

ICFJ’s Knight Fellows are global media innovators who foster news innovation and experimentation to deepen coverage, expand news delivery and better engage citizens. As part of their work, they’ve created tools that they are eager to share with journalists worldwide.

Their projects range from Push, a mobile app for news organizations that don’t have the time, money or resources to build their own, to Salama, a tool that assesses a reporter’s risk and recommends ways to stay safe. These tools and others developed by Knight Fellows can help news organizations everywhere find stories in complex datasets, better distribute their content and keep their journalists safe from online and physical security threats.

As part of the 2016 Online News Association conference, try out these 14 digital tools that any newsroom can use. If you adopt any of these tools or lead any new projects inspired by them, tweet about it to @ICFJKnight.

I was mis-led by the presentation of the “14 free digital tools.”

The box where African Network of Centers for Investigative Reporting (ANCIR) and Aleph appear has a scroll marker on the right hand side.

I’m not sure why I missed it or why the embedding of a scrolling box is considered good page design.

But the tools themselves merit your attention.


Tor is released, with important fixes

Friday, September 23rd, 2016

Tor is released, with important fixes

Source available today, packages over the next week.

Privacy is an active, not passive stance.

Steps to take:

  1. Upgrade your Tor software.
  2. Help someone upgrade their Tor software.
  3. Introduce one new person to Tor.

If you take those steps with every upgrade, Tor will spread more quickly.

I have this vision of James Clapper (Director of National Intelligence), waking up in a cold sweat as darkness spreads across a visualization of the Internet in real time.

Just a vision but an entertaining one.

5 lessons on the craft of journalism from Longform podcast

Friday, September 23rd, 2016

5 lessons on the craft of journalism from Longform podcast by Joe Freeman.

From the post:

AT FIRST I WAS RELUCTANT to dive into the Longform podcast, a series of interviews with nonfiction writers and journalists that recently produced its 200th episode. The reasons for my wariness were petty. What sane freelancer wants to listen to highly successful writers and editors droning on about their awards and awesome careers? Not this guy! But about a year ago, I succumbed, and quickly became a thankful convert. The more I listened, the more I realized that the show, started in 2012 on the website and produced in collaboration with The Atavist, was a veritable goldmine of information. It’s almost as if the top baseball players in the country sat down every week and casually explained how to hit home runs.

Whether they meant to or not, the podcast’s creators and interviewers—Aaron Lammer, Max Linsky, and Evan Ratliff—have produced a free master class on narrative reporting, with practitioners sharing tips and advice about the craft and, crucially, the business. As a journalist, I’ve learned a lot listening to the podcast, but a few consistent themes emerge that I have distilled into five takeaways from specific interviews.

(emphasis in original)

I’m impressed with Joe’s five takeaways but as I sit here repackaging leaked data, there is one common characteristic I would emphasize:

They all involve writing!

That is the actual production of content.

Not plans for content.

Not models for content.

Not abstractions for content.


Not to worry, I intend to keep my tools/theory edge but in addition to adding Longform podcast to my listening list, I’m going to try to produce more data content as well.

I started off with that intention using XQuery at the start of this year, a theme that is likely to re-appear in the near future.