Clinton/Podesta Emails, Dirty Data, Dirty Script For Testing

October 24th, 2016

Despite Micheal Best’s (@NatSecGeek) efforts at collecting the Podesta emails for convenient bulk download, Podesta Emails Zipped, the bulk downloads don’t appear to have attracted a lot of attention. Some 276 views as of today.

Many of us deeply appreciate Michael’s efforts and would like to see the press and others taking fuller advantage of this remarkable resource.

To encourage you in that direction, what follows is a very dirty script for testing the DKIM signatures in the emails and extracting data from the emails for writing to a “|” delimited file.


import dateutil.parser
import email
import dkim
import glob

output = open("verify.txt", 'w')

output.write ("id|verified|date|from|to|subject|message-id \n")

for name in glob.glob('*.eml'):
filename = name
f = open(filename, 'r')
data =
msg = email.message_from_string(data)

verified = dkim.verify(data)

date = dateutil.parser.parse(msg['date'])

msg_from = msg['from']
msg_from1 = " ".join(msg_from.split())
msg_to = str(msg['to'])
msg_to1 = " ".join(msg_to.split())
msg_subject = str(msg['subject'])
msg_subject1 = " ".join(msg_subject.split())
msg_message_id = msg['message-id']

output.write (filename + '|' + str(verified) + '|' + str(date) +
'|' + msg_from1 + '|' + msg_to1 + '|' + msg_subject1 +
'|' + str(msg_message_id) + "\n")


Download podesta-test.tar.gz, unpack that to a directory and then save/uppack to the same directory, then:


Import that into Gnumeric and with some formatting, your content should look like: test-clinton-24Oct2016.gnumeric.gz.

Verifying cryptographic signatures takes a moment, even on this sample of 754 files so don’t be impatient.

This script leaves much to be desired and as you can see, the results aren’t perfect by any means.

Comments and/or suggestions welcome!

This is just the first step in extracting information from this data set that could be used with similar data sets.

For example, if you want to graph this data, how are you going to construct IDs for the nodes, given the repetition of some nodes in the data set?

How are you going to model those relationships?

Bonus question: Is this output clean enough to run the script on the full data set? Which is increasing on a daily basis?

Data Science for Political and Social Phenomena [Special Interest Search Interface]

October 23rd, 2016

Data Science for Political and Social Phenomena by Chris Albon.

From the webpage:

I am a data scientist and quantitative political scientist. I specialize in the technical and organizational aspects of applying data science to political and social issues.

Years ago I noticed a gap in the existing data literature. On one side was data science, with roots in mathematics and computer science. On the other side were the social sciences, with hard-earned expertise modeling and predicting complex human behavior. The motivation for this site and ongoing book project is to bridge that gap: to create a practical guide to applying data science to political and social phenomena.

Chris has organized three hundred and twenty-eight pages on Data Wrangling, Python, R, etc.

If you like learning from examples, this is the site for you!

Including this site, what other twelve (12) sites would you include in a Python/R Data Science search interface?

That is an interface that has indexed only that baker’s dozen of sites. So you don’t spend time wading through “the G that is not named” search results.

Serious question.

Not that I would want to maintain such a beast for external use, but having a local search engine tuned to your particular interests could be nice.

Boosting (in Machine Learning) as a Metaphor for Diverse Teams [A Quibble]

October 23rd, 2016

Boosting (in Machine Learning) as a Metaphor for Diverse Teams by Renee Teate.

Renee’s summary:

tl;dr: Boosting ensemble algorithms in Machine Learning use an approach that is similar to assembling a diverse team with a variety of strengths and experiences. If machines make better decisions by combining a bunch of “less qualified opinions” vs “asking one expert”, then maybe people would, too.

Very much worth your while to read at length but to setup my quibble:

What a Random Forest does is build up a whole bunch of “dumb” decision trees by only analyzing a subset of the data at a time. A limited set of features (columns) from a portion of the overall records (rows) is used to generate each decision tree, and the “depth” of the tree (and/or size of the “leaves”, the number of examples that fall into each final bin) is limited as well. So the trees in the model are “trained” with only a portion of the available data and therefore don’t individually generate very accurate classifications.

However, it turns out that when you combine the results of a bunch of these “dumb” trees (also known as “weak learners”), the combined result is usually even better than the most finely-tuned single full decision tree. (So you can see how the algorithm got its name – a whole bunch of small trees, somewhat randomly generated, but used in combination is a random forest!)

All true but “weak learners” in machine learning are easily reconfigured, combined with different groups of other “weak learners,” or even discarded.

None of which is true for people who are hired to be part of a diverse team.

I don’t mean to discount Renee’s metaphor because I think it has much to recommend it, but diverse “weak learners” make poor decisions too.

Don’t take my word for it, watch the 2016 congressional election results.

Be sure to follow Renee on @BecomingDataSci. I’m interested to see how she develops this metaphor and where it leads.


Monetizing Twitter Trolls

October 23rd, 2016

Alex Hern‘s coverage of Twitter’s fail-to-sell story, Did trolls cost Twitter $3.5bn and its sale?, is a typical short on facts story about abuse on Twitter.

When I say short on facts, I don’t deny any of the anecdotal accounts of abuse on Twitter and other social media.

Here’s the data problem with abuse at Twitter:

As of May of 2016, Twitter had 310 active monthly users over 1.3 billion accounts.

Number of Twitter users who are abusive (trolls): unknown

Number of Twitter users who are victims: unknown

Number of abusive tweets, daily/weekly/monthly: unknown

Type/frequency of abusive tweets, language, images, disclosure: unknown

Costs to effectively control trolls: unknown

Trolls and abuse should be opposed both at Twitter and elsewhere, but without supporting data, creating corporate priorities and revenues to effectively block (not end, block) abuse isn’t possible.

Since troll hunting at present is a drain on the bottom line with no return for Twitter, what if Twitter were to monetize its trolls?

That is create a mechanism whereby trolls became the drivers of a revenue stream from Twitter.

One such approach would be to throw off all the filtering that Twitter does as part of its basic service. If you have Twitter basic service, you will see posts from everyone from committed jihadists to the Federal Reserve. Not blocked accounts, no deleted accounts, etc.

Twitter removes material under direct court order only. Put the burden and expense on going to court for every tweet on both individuals and governments. No exceptions.

Next, Twitter creates the Twitter+ account, where for an annual fee, users can access advanced filtering that includes blocking people, language, image analysis of images posted to them, etc.

Price point experiments should set the fees for Twitter+ accounts. Filtering will be a decision based on real revenue numbers. Not flights of fancy by the Guardian or Sales Force.

BTW, the open Twitter I suggest creates more eyes for ads, which should also improve the bottom line at Twitter.

An “open” Twitter will attract more trolls and drive more users to Twitter+ accounts.

Twitter trolls generate the revenue to fight them.

I rather like that.


Twitter Logic: 1 call on Github v. 885,222 calls on Twitter

October 23rd, 2016

Chris Albon’s collection of 885,222 tweets (ids only) for the third presidential debate of 2016 proves bad design decisions aren’t only made inside the Capital Beltway.

Chris could not post his tweet collection, only the tweet ids under Twitter’s terms of service.

The terms of service reference the Developer Policy and under that policy you will find:

F. Be a Good Partner to Twitter

1. Follow the guidelines for using Tweets in broadcast if you display Tweets offline.

2. If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.

a. You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.

b. Any Content provided to third parties via non-automated file download remains subject to this Policy.
…(emphasis added)

Just to be clear, I find Twitter extremely useful for staying current on CS research topics and think developers should be “…good partners to Twitter.”

However, Chris is prohibited from posting a data set of 885,222 tweets on Gibhub, where users could download it with no impact on Twitter, versus every user who want to explore that data set must submit 885,222 requests to Twitter servers.

Having one hit on Github for 885,222 tweets versus 885,222 on Twitter servers sounds like being a “good partner” to me.

Multiple that by all the researchers who are building Twitter data sets and the drain on Twitter resources grows without any benefit to Twitter.

It’s true that someday Twitter might be able to monetize references to its data collections, but server and bandwidth expenses are present line items in their budget.

Enabling the distribution of full tweet datasets is one step towards improving their bottom line.

PS: Please share this with anyone you know at Twitter. Thanks!

Political Noise Data (Tweets From 3rd 2016 Presidential Debate)

October 23rd, 2016

Chris Albon has collected data on 885,222 debate tweets from the third Presidential Debate of 2016.

As you can see from the transcript, it wasn’t a “debate” in any meaningful sense of the term.

The quality of tweets about that debate are equally questionable.

However, the people behind those tweets vote, buy products, click on ads, etc., so despite my title description as “political noise data,” it is important political noise data.

To conform to Twitter terms of service, Chris provides the relevant tweet ids and a script to enable construction of your own data set.

BTW, Chris includes his Twitter mining scripts.


Validating Wikileaks Emails [Just The Facts]

October 22nd, 2016

A factual basis for reporting on alleged “doctored” or “falsified” emails from Wikileaks has emerged.

Now to see if the organizations and individuals responsible for repeating those allegations, some 260,000 times, will put their doubts to the test.

You know where my money is riding.

If you want to verify the Podesta emails or other email leaks from Wikileaks, consult the following resources.

Yes, we can validate the Wikileaks emails by Robert Graham.

From the post:

Recently, WikiLeaks has released emails from Democrats. Many have repeatedly claimed that some of these emails are fake or have been modified, that there’s no way to validate each and every one of them as being true. Actually, there is, using a mechanism called DKIM.

DKIM is a system designed to stop spam. It works by verifying the sender of the email. Moreover, as a side effect, it verifies that the email has not been altered.

Hillary’s team uses “”, which as DKIM enabled. Thus, we can verify whether some of these emails are true.

Recently, in response to a leaked email suggesting Donna Brazile gave Hillary’s team early access to debate questions, she defended herself by suggesting the email had been “doctored” or “falsified”. That’s not true. We can use DKIM to verify it.

Bob walks you through validating a raw email from Wikileaks with the DKIM verifier plugin for Thunderbird. And demonstrating the same process can detect “doctored” or “falsified” emails.

Bob concludes:

I was just listening to ABC News about this story. It repeated Democrat talking points that the WikiLeaks emails weren’t validated. That’s a lie. This email in particular has been validated. I just did it, and shown you how you can validate it, too.

Btw, if you can forge an email that validates correctly as I’ve shown, I’ll give you 1-bitcoin. It’s the easiest way of solving arguments whether this really validates the email — if somebody tells you this blogpost is invalid, then tell them they can earn about $600 (current value of BTC) proving it. Otherwise, no.

BTW, Bob also points to:

Here’s Cryptographic Proof That Donna Brazile Is Wrong, WikiLeaks Emails Are Real by Luke Rosiak, which includes this Python code to verify the emails:



Verifying Wikileaks DKIM-Signatures by teknotus, offers this manual approach for testing the signatures:


But those are all one-off methods and there are thousands of emails.

But the post by teknotus goes on:

Preliminary results

I only got signature validation on some of the emails I tested initially but this doesn’t necessarily invalidate them as invisible changes to make them display correctly on different machines done automatically by browsers could be enough to break the signatures. Not all messages are signed. Etc. Many of the messages that failed were stuff like advertising where nobody would have incentive to break the signatures, so I think I can safely assume my test isn’t perfect. I decided at this point to try to validate as many messages as I could so that people researching these emails have any reference point to start from. Rather than download messages from wikileaks one at a time I found someone had already done that for the Podesta emails, and uploaded zip files to

Emails 1-4160
Emails 4161-5360
Emails 5361-7241
Emails 7242-9077
Emails 9078-11107

It only took me about 5 minutes to download all of them. Writing a script to test all of them was pretty straightforward. The program dkimverify just calls a python function to test a message. The tricky part is providing context, and making the results easy to search.

Automated testing of thousands of messages

It’s up on Github

It’s main output is a spreadsheet with test results, and some metadata from the message being tested. Results Spreadsheet 1.5 Megs

It has some significant bugs at the moment. For example Unicode isn’t properly converted, and spreadsheet programs think the Unicode bits are formulas. I also had to trap a bunch of exceptions to keep the program from crashing.

Warning: I have difficulty opening the verify.xlsx file. In Calc, Excel and in a CSV converter. Teknotus reports it opens in LibreOffice Calc, which just failed to install on an older Ubuntu distribution. Sing out if you can successfully open the file.

Journalists: Are you going to validate Podesta emails that you cite? Or that others claim are false/modified?

Python and Machine Learning in Astronomy (Rejuvenate Your Emotional Health)

October 22nd, 2016

Python and Machine Learning in Astronomy (Episode #81) (Jack VanderPlas)

From the webpage:

The advances in Astronomy over the past century are both evidence of and confirmation of the highest heights of human ingenuity. We have learned by studying the frequency of light that the universe is expanding. By observing the orbit of Mercury that Einstein’s theory of general relativity is correct.

It probably won’t surprise you to learn that Python and data science play a central role in modern day Astronomy. This week you’ll meet Jake VanderPlas, an astrophysicist and data scientist from University of Washington. Join Jake and me while we discuss the state of Python in Astronomy.

Links from the show:

Jake on Twitter: @jakevdp

Jake on the web:

Python Data Science Handbook:

Python Data Science Handbook on GitHub:

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data:

PyData Talk:

eScience Institue: @UWeScience

Large Synoptic Survey Telescope:

AstroML: Machine Learning and Data Mining for Astronomy:

Astropy project:

altair package:

If you social media feeds have been getting you down, rejoice! This interview with Jake VanderPlas covers Python, machine learning and astronomy.

Nary a mention of current social dysfunction around the globe!

Replace an hour of TV this weekend with this podcast. (Or more hours with others.)

Not only will you have more knowledge, you will be in much better emotional shape to face the coming week!

Validating Wikileaks/Podesta Emails

October 21st, 2016

A quick heads up that Robert Graham is working on:


While we wait for that post to appear at Errata Security, you should also take a look at DomainKeys Identified Mail (DKIM).

From the homepage:

DomainKeys Identified Mail (DKIM) lets an organization take responsibility for a message that is in transit. The organization is a handler of the message, either as its originator or as an intermediary. Their reputation is the basis for evaluating whether to trust the message for further handling, such as delivery. Technically DKIM provides a method for validating a domain name identity that is associated with a message through cryptographic authentication.

In particular, review RFC 5585 DomainKeys Identified Mail (DKIM) Service Overview. T. Hansen, D. Crocker, P. Hallam-Baker. July 2009. (Format: TXT=54110 bytes) (Status: INFORMATIONAL) (DOI: 10.17487/RFC5585), which notes:

2.3. Establishing Message Validity

Though man-in-the-middle attacks are historically rare in email, it is nevertheless theoretically possible for a message to be modified during transit. An interesting side effect of the cryptographic method used by DKIM is that it is possible to be certain that a signed message (or, if l= is used, the signed portion of a message) has not been modified between the time of signing and the time of verifying. If it has been changed in any way, then the message will not be verified successfully with DKIM.

In a later tweet, Bob notes the “DKIM verifier” add-on for Thunderbird.

Any suggestions on scripting DKIM verification for the Podesta emails?

That level of validation may be unnecessary since after more than a week of “…may be altered…,” not one example of a modified email has surfaced.

Some media outlets will keep repeating the “…may be altered…” chant, along with attribution of the DNC hack to Russia.

Noise but it is a way to select candidates for elimination from your news feeds.

Guide to Making Search Relevance Investments, free ebook

October 20th, 2016

Guide to Making Search Relevance Investments, free ebook

Doug Turnbull writes:

How well does search support your business? Are your investments in smarter, more relevant search, paying off? These are business-level questions, not technical ones!

After writing Relevant Search we find ourselves helping clients evaluate their search and discovery investments. Many invest far too little, or struggle to find the areas to make search smarter, unsure of the ROI. Others invest tremendously in supposedly smarter solutions, but have a hard time justifying the expense or understanding the impact of change.

That’s why we’re happy to announce OpenSource Connection’s official search relevance methodology!

The free ebook? Guide to Relevance Investments.

I know, I know, the title is a interest killer.

Think Search ROI. Not something you hear about often but it sounds attractive.

Runs 16 pages and is a blessed relief from the “data has value (unspecified)” mantras.

Search and investment in search is a business decision and this guide nudges you in that direction.

What you do next is up to you.


Every Congressional Research Service Report – 8,000+ and growing!

October 19th, 2016

From the homepage:

We’re publishing reports by Congress’s think tank, the Congressional Research Service, which provides valuable insight and non-partisan analysis of issues of public debate. These reports are already available to the well-connected — we’re making them available to everyone for free.

From the about page:

Congressional Research Service reports are the best way for anyone to quickly get up to speed on major political issues without having to worry about spin — from the same source Congress uses.

CRS is Congress’ think tank, and its reports are relied upon by academics, businesses, judges, policy advocates, students, librarians, journalists, and policymakers for accurate and timely analysis of important policy issues. The reports are not classified and do not contain individualized advice to any specific member of Congress. (More: What is a CRS report?)

Until today, CRS reports were generally available only to the well-connected.

Now, in partnership with a Republican and Democratic member of Congress, we are making these reports available to everyone for free online.

A coalition of public interest groups, journalists, academics, students, some Members of Congress, and former CRS employees have been advocating for greater access to CRS reports for over twenty years. Two bills in Congress to make these reports widely available already have 10 sponsors (S. 2639 and H.R. 4702, 114th Congress) and we urge Congress to finish the job.

This website shows Congress one vision of how it could be done.

What does include? includes 8,255 CRS reports. The number changes regularly.

It’s every CRS report that’s available on Congress’s internal website.

We redact the phone number, email address, and names of virtually all the analysts from the reports. We add disclaimer language regarding copyright and the role CRS reports are intended to play. That’s it.

If you’re looking for older reports, our good friends at may have them.

We also show how much a report has changed over time (whenever CRS publishes an update), provide RSS feeds, and we hope to add more features in the future. Help us make that possible.

To receive an email alert for all new reports and new reports in a particular topic area, use the RSS icon next to the topic area titles and a third-party service, like IFTTT, to monitor the RSS feed for new additions.

This is major joyful news for policy wonks and researchers everywhere.

A must bookmark and contribute to support site!

My joy was alloyed by the notice:

We redact the phone number, email address, and names of virtually all the analysts from the reports. We add disclaimer language regarding copyright and the role CRS reports are intended to play. That’s it.

The privileged, who get the CRS reports anyway, have that information?

What is the value in withholding it from the public?

Support the project but let’s put the public on an even footing with the privileged shall we?

The Podesta Emails [In Bulk]

October 19th, 2016

Wikileaks has been posting:

The Podesta Emails, described as:

WikiLeaks series on deals involving Hillary Clinton campaign Chairman John Podesta. Mr Podesta is a long-term associate of the Clintons and was President Bill Clinton’s Chief of Staff from 1998 until 2001. Mr Podesta also owns the Podesta Group with his brother Tony, a major lobbying firm and is the Chair of the Center for American Progress (CAP), a Washington DC-based think tank.

long enough for them to be decried as “interference” with the U.S. presidential election.

You have two search options, basic:


and, advanced:


As handy as these search interfaces are, you cannot easily:

  • Analyze relationships between multiple senders and/or recipients of emails
  • Perform entity recognition across the emails as a corpus
  • Process the emails with other software
  • Integrate the emails with other data sources
  • etc., etc.

Michael Best, @NatSecGeek, is posting all the Podesta emails as they are released at: Podesta Emails (zipped).

As of Podesta Emails 13, there is approximately 2 GB of zipped email files available for downloading.

The search interfaces at Wikileaks may work for you, but if you want to get closer to the metal, you have Michael Best to thank for that opportunity!


#Truth2016 – The year when truth “interfered” with a democratic election.

October 19th, 2016

Unless you have been in solitary confinement or a medically induced coma for the last several weeks, you are aware that Wikileaks has been accused of “interfering” with the 2016 US presidential election.

The crux of that complaint is the release by Wikileaks of a series of emails collectively known as the Podesta Emails, which are centered on the antics of Hillary Clinton and her crew as she runs for the presidency.

The untrustworthy who made these accusations include the Department of Homeland Security and the Office of the Director of National Intelligence on Election Security. In a no facts revealed statement: Joint Statement from the Department Of Homeland Security and Office of the Director of National Intelligence on Election Security, the claim of interference is made but not substantiated.

The cry of “interference” has been taken up by an uncritical media and echoed by President Barack Obama.

There’s just one problem.

We know who was sent the emails in question and despite fanciful casting of doubt on their accuracy, out of hundred of participants, not one, nary one, has stepped forward with an original email to prove these are false.

Simple enough to ask some third-party expert to retrieve the emails in question from a server and then to compare to the Wikileaks releases.

But I have heard of no moves in that direction.

Have you?

The crux of the current line by the US government is that truthful documents may influence the coming presidential election. In a direction they don’t like.

Think about that for a moment: Truthful documents (in the sense of accuracy) interfering with a democratic election.

That makes me wonder what definition of “democratic” that Clinton, Obama and the media must share?

Not anything I would recognize as a democracy. You?

S20-211a Hebrew Bible Technology Buffet – November 20, 2016 (save that date!)

October 18th, 2016

S20-211a Hebrew Bible Technology Buffet

From the webpage:

On Sunday, November 20th 2016, from 1:00 PM to 3:30 PM, GERT will host a session with the theme “Hebrew Bible Technology Buffet” at the SBL Annual Meeting in room 305 of the Convention Center. Barry Bandstra of Hope College will preside.

The session has four presentations:

Presentations will be followed by a discussion session.

You will need to register for the Annual Meeting to attend the session.

Assuming they are checking “badges” to make sure attendees have registered. Registration is very important to those who “foster” biblical scholarship by comping travel and rooms for their close friends.

PS: The website reports non-member registration is $490.00. I would like to think that is a mis-print but I suspect its not.

That’s one way to isolate yourself from an interested public. By way of contrast, snail-mail Biblical Greek courses in the 1890’s had tens of thousands of subscribers. When academics complain of being marginalized, use this as an example of self-marginalization.

Threatening the President: A Signal/Noise Problem

October 18th, 2016

Even if you can’t remember why the pointy end of a pencil is important, you too can create national news.

This bit of noise reminded me of an incident when I was in high school where some similar type person bragged in a local bar about assassinating then President Nixon*. Was arrested and sentenced to several years in prison.

At the time I puzzled briefly over the waste of time and effort in such a prosecution and then promptly forgot it.

Until this incident with the overly “clever” Trump supporter.

To get us off on the same foot:

18 U.S. Code § 871 – Threats against President and successors to the Presidency

(a) Whoever knowingly and willfully deposits for conveyance in the mail or for a delivery from any post office or by any letter carrier any letter, paper, writing, print, missive, or document containing any threat to take the life of, to kidnap, or to inflict bodily harm upon the President of the United States, the President-elect, the Vice President or other officer next in the order of succession to the office of President of the United States, or the Vice President-elect, or knowingly and willfully otherwise makes any such threat against the President, President-elect, Vice President or other officer next in the order of succession to the office of President, or Vice President-elect, shall be fined under this title or imprisoned not more than five years, or both.

(b) The terms “President-elect” and “Vice President-elect” as used in this section shall mean such persons as are the apparent successful candidates for the offices of President and Vice President, respectively, as ascertained from the results of the general elections held to determine the electors of President and Vice President in accordance with title 3, United States Code, sections 1 and 2. The phrase “other officer next in the order of succession to the office of President” as used in this section shall mean the person next in the order of succession to act as President in accordance with title 3, United States Code, sections 19 and 20.

Commonplace threatening letters, calls, etc., aren’t documented for the public but President Barack Obama has a Wikipedia page devoted to the more significant ones: Assassination threats against Barack Obama.

Just as no one knows you are a dog on the internet, no one can tell by looking at a threat online if you are still learning how to use a pencil or are a more serious opponent.

Leaving to one side that a truly serious opponent allows actions to announce their presence or goal.

The treatment of even idle bar threats as serious is an attempt to improve the signal-to-noise ratio:

In analog and digital communications, signal-to-noise ratio, often written S/N or SNR, is a measure of signal strength relative to background noise. The ratio is usually measured in decibels (dB) using a signal-to-noise ratio formula. If the incoming signal strength in microvolts is Vs, and the noise level, also in microvolts, is Vn, then the signal-to-noise ratio, S/N, in decibels is given by the formula: S/N = 20 log10(Vs/Vn)

If Vs = Vn, then S/N = 0. In this situation, the signal borders on unreadable, because the noise level severely competes with it. In digital communications, this will probably cause a reduction in data speed because of frequent errors that require the source (transmitting) computer or terminal to resend some packets of data.

I’m guessing the reasoning is the more threats that go unspoken, the less chaff the Secret Service has to winnow in order to uncover viable threats.

One assumes they discard physical mail with return addresses of prisons, mental hospitals, etc., or at most request notice of the release of such people from state custody.

Beyond that, they don’t appear to be too picky about credible threats, noting that in one case an unspecified “death ray” was going to be used against President Obama.

The EuroNews description of that case must be shared:

Two American men have been arrested and charged with building a remote-controlled X-ray machine intended for killing Muslims and other perceived enemies of the U.S.

Following a 15-month investigation launched in April 2012, Glenn Scott Crawford and Eric J. Feight are accused of developing the device, which the FBI has described as “mobile, remotely operated, radiation emitting and capable of killing human targets silently and from a distance with lethal doses of radiation”.

Sure, right. I will post a copy of the 67-page complaint, which uses terminology rather loosely, to say the least, in a day or so. Suffice it to say that the defendants never acquired a source for the needed radioactivity production.

On the order of having a complete nuclear bomb but not nuclear material to make it into a nuclear bomb. You would be in more danger from the conventional explosive degrading than the bomb as a nuclear weapon.

Those charged with defending public officials want to deter the making of threats, so as to improve the signal/noise ratio.

The goal of those attacking public officials is a signal/noise ratio of exactly 0.0.

Viewing threats from an information science perspective suggests various strategies for either side. (Another dividend of studying information science.)

*They did find a good picture of Nixon for the White House page. Doesn’t look as much like a weasel as he did in real life. Gimp/Photoshop you think?

How To Read: “War Goes Viral” (with caution, propaganda ahead)

October 17th, 2016


War Goes Viral – How social media is being weaponized across the world by Emerson T. Brooking and P. W. Singer.

One of the highlights of the post reads:

Perhaps the greatest danger in this dynamic is that, although information that goes viral holds unquestionable power, it bears no special claim to truth or accuracy. Homophily all but ensures that. A multi-university study of five years of Facebook activity, titled “The Spreading of Misinformation Online,” was recently published in Proceedings of the National Academy of Sciences. Its authors found that the likelihood of someone believing and sharing a story was determined by its coherence with their prior beliefs and the number of their friends who had already shared it—not any inherent quality of the story itself. Stories didn’t start new conversations so much as echo preexisting beliefs.

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Ooooh, “…’truth’ becomes a matter of emotional resonance.”

That is always true but give the authors their due, “War Goes Viral” is a masterful piece of propaganda to the contrary.

Calling something “propaganda,” or “media bias” is easy and commonplace.

Let’s do the hard part and illustrate why that is the case with “War Goes Viral.”

The tag line:

How social media is being weaponized across the world

preps us to think:

Someone or some group is weaponizing social media.

So before even starting the article proper, we are prepared to be on the look out for the “bad guys.”

The authors are happy to oblige with #AllEyesOnISIS, first paragraph, second sentence. “The self-styled Islamic State…” appears in the second paragraph and ISIS in the third paragraph. Not much doubt who the “bad guys” are at this point in the article.

Listing only each change of current actors, “bad guys” in red, the article from start to finish names:

  • Islamic State
  • Russia
  • Venezuela
  • China
  • U.S. Army training to combat “bad guys”
  • Israel – neutral
  • Islamic State (Hussain)

The authors leave you with little doubt who they see as the “bad guys,” a one-sided view of propaganda and social media in particular.

For example, there is:

No mention of Voice of American (VOA), perhaps one of the longest running, continuous disinformation campaigns in history.

No mention of Pentagon admits funding online propaganda war against Isis.

No mention of any number of similar projects and programs which weren’t constructed with an eye on “truth and accuracy” by the United States.

The treatment here is as one-sided as the “weaponized” social media of which the authors complain.

Not that the authors are lacking in skill. They piggyback their own slant onto The Spreading of Misinformation Online:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.” As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

How much of that is supported by The Spreading of Misinformation Online?

  • First sentence
  • Second sentence
  • Both sentences

The answer is:

This extreme ideological segregation, the authors concluded, “comes at the expense of the quality of the information and leads to proliferation of biased narratives fomented by unsubstantiated rumors, mistrust, and paranoia.”

The remainder of that paragraph was invented out of whole clothe by the authors and positioned with “truth” in quotes to piggyback on the legitimate academic work just quoted.

As smartphone cameras and streaming video turn every bystander into a reporter (and everyone with an internet connection into an analyst), “truth” becomes a matter of emotional resonance.

Is popular cant among media and academic types but no more than that.

Skilled reporting can put information in a broad context and weave a coherent narrative, but disparaging social media authors doesn’t make that any more likely.

“War Goes Viral” being a case in point.

XML Prague 2017 is coming

October 16th, 2016

XML Prague 2017 is coming by Jirka Kosek.

From the post:

I’m happy to announce that call for papers for XML Prague 2017 is finally open. We are looking forward for your interesting submissions related to XML. We have switched from CMT to EasyChair for managing submission process – we hope that new system will have less quirks for users then previous one.

We are sorry for slightly delayed start than in past years. But we have to setup new non-profit organization for running the conference and sometimes we felt like characters from Kafka’s Der Process during this process.

We are now working hard on redesigning and opening of registration. Process should be more smooth then in the past.

But these are just implementation details. XML Prague will be again three day gathering of XML geeks, users, vendors, … which we all are used to enjoy each year. I’m looking forward to meet you in Prague in February.

Conference: February 9-11, 2016.

Important Dates:

Important Dates:

  • December 15th – End of CFP (full paper or extended abstract)
  • January 8th – Notification of acceptance/rejection of paper to authors
  • January 29th – Final paper

You can see videos of last year’s presentation (to gauge the competition): Watch videos from XML Prague 2016 on Youtube channel.

December the 15th will be here sooner than you think!

Think of it as a welcome distraction from the barn yard posturing that is U.S. election politics this year!

Why I Distrust US Intelligence Experts, Let Me Count the Ways

October 16th, 2016

Some US Intelligence failures, oldest to most recent:

  1. Pearl Harbor
  2. The Bay of Pigs Invasion
  3. Cuban Missile Crisis
  4. Vietnam
  5. Tet Offensive
  6. Yom Kippur War
  7. Iranian Revolution
  8. Soviet Invasion of Afghanistan
  9. Collapse of the Soviet Union
  10. Indian Nuclear Test
  11. 9/11 Attacks
  12. Iraq War (WMDs)
  13. Invasion of Afghanistan (US)
  14. Israeli moles in US intelligence, various dates

Those are just a few of the failures of US intelligence, some of which cost hundreds of thousands if not millions of lives.

Yet, you can read today: Trump’s refusal to accept intelligence briefing on Russia stuns experts.

There are only three reasons I can think of to accept findings by the US intelligence community:

  1. You are on their payroll and for that to continue, well, you know.
  2. As a member of the media, future tips/leaks depends upon your acceptance of current leaks. Anyone who mocks intelligence service lies is cut off from future lies.
  3. As a politician, the intelligence findings discredit facts unfavorable to you.

For completeness sake, I should mention that intelligence “experts” could be telling the truth but given their track record, it is an edge case.

Before repeating the mindless cant of “the Russians are interfering with the US election,” stop to ask your sources, “…based on what?” Opinions of all the members of the US intelligence community = one opinion. Ask for facts. No facts offered, report that instead of the common “opinion.”

Why Journalists Should Not Rely On Wikileaks Indexing – Podesta Emails

October 15th, 2016

Clinton on Fracking, or, Another Reason to Avoid Wikileaks Indexing


The quote in the tweet is false.

Politico supplies the correct quotation in its post:

“Bernie Sanders is getting lots of support from the most radical environmentalists because he’s out there every day bashing the Keystone pipeline. And, you know, I’m not into it for that,” Clinton told the unions, according to the transcript. “My view is, I want to defend natural gas. … I want to defend fracking under the right circumstances.”

I’m guessing that “…under the right circumstances.” must have pushed Wikileaks too close to the 140 character barrier.

Ditto for the Wikileaks mis-quote of: “Get a life.”

Which reported as in the tweet, appears to refer to unbridled fracking.

Not so in the Politico post:

“I’m already at odds with the most organized and wildest” of the environmental movement, Clinton told building trades unions in September 2015, according to a transcript of the remarks apparently circulated by her aides. “They come to my rallies and they yell at me and, you know, all the rest of it. They say, ‘Will you promise never to take any fossil fuels out of the earth ever again?’ No. I won’t promise that. Get a life, you know.”

Doesn’t read quite the same way does it?

I supposed once you start lying it’s really hard to stop. Clinton is a good example of that and Wikileaks should not follow her example.

It’s hard to spot these lies because Wikileaks isn’t indexing the attachments.

You can search all day for “defend fracking,” “get a life” (by Clinton) and you will come up empty (at least as of today).

So that you don’t have to search for: 20150909 Transcript | Building Trades Union (Keystone XL) at Wikileaks – Podesta Emails, I have produced a PDF version of that attachment, Building-Trades-Union-Clinton-Sept-09-2015.pdf (my naming), for your viewing pleasure.

Green’s Dictionary of Slang [New Commercializing Information Model?]

October 14th, 2016

Green’s Dictionary of Slang

From the about page:

Green’s Dictionary of Slang is the largest historical dictionary of English slang. Written by Jonathon Green over 17 years from 1993, it reached the printed page in 2010 in a three-volume set containing nearly 100,000 entries supported by over 400,000 citations from c. ad 1000 to the present day. The main focus of the dictionary is the coverage of over 500 years of slang from c. 1500 onwards.

The printed version of the dictionary received the Dartmouth Medal for outstanding works of reference from the American Library Association in 2012; fellow recipients include the Dictionary of American Regional English, the Oxford Dictionary of National Biography, and the New Grove Dictionary of Music and Musicians. It has been hailed by the American New York Times as ‘the pièce de résistance of English slang studies’ and by the British Sunday Times as ‘a stupendous achievement, in range, meticulous scholarship, and not least entertainment value’.

On this website the dictionary is now available in updated online form for the first time, complete with advanced search tools enabling search by definition and history, and an expanded bibliography of slang sources from the early modern period to the present day. Since the print edition, nearly 60,000 quotations have been added, supporting 5,000 new senses in 2,500 new entries and sub-entries, of which around half are new slang terms from the last five years.

Green’s Dictionary of Slang has an interesting commercial model.

You can search for any word, freely, but “more search features” requires a subscription:

By subscribing to Green’s Dictionary of Slang Online, you gain access to advanced search tools (including the ability to search for words by meaning, history, and usage), full historical citations in each entry, and a bibliography of over 9,000 slang sources.

Current rate for individuals is £ 49 (or about $59.96).

In addition to being a fascinating collection of information, is the free/commercial split here of interest?

An alternative to:

The Teaser Model

Contrast the Oxford Music Online:

Grove Music Online is the eighth edition of Grove’s Dictionary of Music and Musicians, and contains articles commissioned specifically for the site as well as articles from New Grove 2001, Grove Opera, and Grove Jazz. The recently published second editions of The Grove Dictionary of American Music and The Grove Dictionary of Musical Instruments are still being put online, and new articles are added to GMO with each site update.

Oh, Oxford Music Online isn’t all pay-per-view.

It offers the following thirteen (13) articles for free viewing:

Sotiria Bellou, Greek singer of rebetiko song, famous for the special quality and register of her voice

Cell [Mobile] Phone Orchestra, ensemble of performers using programmable mobile (cellular) phones

Crete, largest and most populous of the Greek islands

Lyuba Encheva, Bulgarian pianist and teacher

Gaaw, generic term for drums, and specifically the frame drum, of the Tlingit and Haida peoples of Alaska

Johanna Kinkel, German composer, writer, pianist, music teacher, and conductor

Lady’s Glove Controller, modified glove that can control sound, mechanical devices, and lights

Outsider music, a loosely related set of recordings that do not fit well within any pre-existing generic framework

Peter (Joshua) Sculthorpe, Australian composer, seen by the Australian musical public as the most nationally representative.

Slovenia, country in southern Central Europe

Sound art, a term ecompassing a variety of art forms that utlize sound, or comment on auditory cultures

Alice (Bigelow) Tully, American singer and music philanthropist

Wars in Iraq and Afghanistan, soliders’ relationship with music is largely shaped by contemporary audio technology

Hmmm, 160,000 slang terms for free from Green’s Dictionary of Slang versus 13 free articles from Oxford Music Online.

Show of hands for the teaser model of Oxford Music Online?

The Consumer As Product

You are aware that casual web browsing and alleged “free” sites are not just supported by ads, but by the information they collect on you?

Consider this rather boastful touting of information collection capabilities:

To collect online data, we use our native tracking tags as experience has shown that other methods require a great deal of time, effort and cost on both ends and almost never yield satisfactory coverage or results since they depend on data provided by third parties or compiled by humans (!!), without being able to verify the quality of the information. We have a simple universal server-side tag that works with most tag managers. Collecting offline marketing data is a bit trickier. For TV and radio, we will with your offline advertising agency to collect post-log reports on a weekly basis, transmitted to a secure FTP. Typical parameters include flight and cost, date/time stamp, network, program, creative length, time of spot, GRP, etc.

Convertro is also able to collect other type of offline data, such as in-store sales, phone orders or catalog feeds. Our most popular proprietary solution involves placing a view pixel within a confirmation email. This makes it possible for our customers to tie these users to prior online activity without sharing private user information with us. For some customers, we are able to match almost 100% of offline sales. Other customers that have different conversion data can feed them into our system and match it to online activity by partnering with LiveRamp. These matches usually have a success rate between 30%-50%. Phone orders are tracked by utilizing a smart combination of our in-house approach, the inputting of special codes, or by third party vendors such as Mongoose and ResponseTap.v

You don’t have to be on the web, you can be tracked “in-store,” on the phone, etc.

Converto doesn’t mention explicitly “supercookies,” for which Verizon just paid a $1.35 Million fine. From the post:

“Supercookies,” known officially as unique identifier headers [UIDH], are short-term serial numbers used by corporations to track customer data for advertising purposes. According to Jacob Hoffman-Andrews, a technologist with the Electronic Frontier Foundation, these cookies can be read by any web server one visits used to build individual profiles of internet habits. These cookies are hard to detect, and even harder to get rid of.

If any of that sounds objectionable to you, remember that to be valuable, user habits must be tracked.

That is if you find the idea of being a product acceptable.

The Green’s Dictionary of Slang offers an economic model that enables free access to casual users, kids writing book reports, journalists, etc., while at the same time creating a value-add that power users will pay for.

Other examples of value-add models with free access to the core information?

What would that look like for the Podesta emails?

Becoming a Data Scientist:

October 13th, 2016

Becoming a Data Scientist: Advice From My Podcast Guests

Out-gassing from political candidates has kept pushing this summary by Renée Teate back in my queue. Well, fixing that today!

René has created more data science resources than I can easily mention so in addition to this guide, I will mention only two:

Data Science Renee @BecomingDataSci, a Twitter account that will soon break into the rarefied air of > 10,000 followers. Not yet, but you may be the one that puts her over the top!

Looking for women to speak at data science conferences? Renée maintains Women in Data Science, which today has 815 members.

Sorry, three, her blog: Becoming a Data Scientist.

That should keep you busy/distracted until the political noise subsides. ;-)

Obama on Fixing Government with Technology (sigh)

October 13th, 2016

Obama on Fixing Government with Technology by Caitlin Fairchild.

Like any true technology cultist, President Obama mentions technology, inefficiency, but never the people who make up government as the source of government “problems.” Nor does he appear to realize that technology cannot fix the people who make up government.

Those out-dated information systems he alludes to were built and are maintained under contract with vendors. Systems that are used by users who are accustomed to those systems and will resist changing to others. Still other systems rely upon those systems being as they are in terms of work flow. And so on. At its very core, the problem of government isn’t technology.

It’s the twin requirement that it be composed of and supplied by people, all of who have a vested interest and comfort level with the technology they use and, don’t forget, government has to operate 24/7, 365 days out of the year.

There is no time to take down part of the government to develop new technology, train users in its use and at the same time, run all the current systems which are, to some degree, meeting current requirements.

As an antidote to the technology cultism that infects President Obama and his administration, consider reading Geek Heresy, the description of which reads:

In 2004, Kentaro Toyama, an award-winning computer scientist, moved to India to start a new research group for Microsoft. Its mission: to explore novel technological solutions to the world’s persistent social problems. Together with his team, he invented electronic devices for under-resourced urban schools and developed digital platforms for remote agrarian communities. But after a decade of designing technologies for humanitarian causes, Toyama concluded that no technology, however dazzling, could cause social change on its own.

Technologists and policy-makers love to boast about modern innovation, and in their excitement, they exuberantly tout technology’s boon to society. But what have our gadgets actually accomplished? Over the last four decades, America saw an explosion of new technologies – from the Internet to the iPhone, from Google to Facebook – but in that same period, the rate of poverty stagnated at a stubborn 13%, only to rise in the recent recession. So, a golden age of innovation in the world’s most advanced country did nothing for our most prominent social ill.

Toyama’s warning resounds: Don’t believe the hype! Technology is never the main driver of social progress. Geek Heresy inoculates us against the glib rhetoric of tech utopians by revealing that technology is only an amplifier of human conditions. By telling the moving stories of extraordinary people like Patrick Awuah, a Microsoft millionaire who left his lucrative engineering job to open Ghana’s first liberal arts university, and Tara Sreenivasa, a graduate of a remarkable South Indian school that takes children from dollar-a-day families into the high-tech offices of Goldman Sachs and Mercedes-Benz, Toyama shows that even in a world steeped in technology, social challenges are best met with deeply social solutions.

Government is a social problem and to reach for a technology fix first, is a guarantee of yet another government failure.

IBM’s Program Of Security Via Obscurity (Censorship)

October 13th, 2016

Before today, my response to the question: “Does IBM promote security through obscurity?” would have been no!

Today? Full Disclosure @SecLists posted this tweet:


A working version of the URL:

I don’t suppose better software engineering practices and/or rapid repair of IBM’s software occurred to anyone?

George Carlin’s Seven Dirty Words in Podesta Emails – Discovered 981 Unindexed Documents

October 13th, 2016

While taking a break from serious crunching of the Podesta emails I discovered 981 unindexed documents at Wikileaks!

Try searching for Carlin’s seven dirty words at The Podesta Emails:

  • shit – 44
  • piss – 19
  • fuck – 13
  • cunt – 0
  • cocksucker – 0
  • motherfucker – 0 (?)
  • tits – 0

I have a ? after “motherfucker” because working with the raw files I show one (1) hit for “motherfucker” and one (1) hit for “motherfucking.” Separate emails.

For “motherfucker,” American Sniper–the movie, responded to by Chris Hedges – To: Podesta@Law.Georgetown.Edu

For “motherfucking,” H4A News Clips 5.31.15 – From/To:

“Motherfucker” and “motherfucking” occur in text attachments to emails, which Wikileaks does not search.

If you do a blank search for file attachments, Wikileaks reports there are 2427 file attachments.

Searching the Podesta emails at Wikileaks excludes the contents of 2427 files from your search results.

How significant is that?

Hmmm, 302 pdf, 501 docx, 167 doc, 12 xls, 9 xlsx – 981 documents excluded from your searches at Wikileaks.

For 9,011 emails, as of AM today, my local.

How comfortable are you with not searching those 981 documents? (Or additional documents that may follow?)

How-To Spot An Armchair Jihadist

October 12th, 2016

To efficiently use law enforcement resources against threats to civil order, the police must recognize the difference between an actual jihadist and an armchair jihadist.

An armchair jihadist is one that talks a good game, dreams of raining fire and death on infidels, etc., but in truth, is the Walter Mitty of terrorism.

Unfortunately, law enforcement disproportionately captures armchair jihadists, for example, the arrest of Samata Ullah, who was charged in part with possession of:

…a book about guided missiles and a PDF version of a book about advanced missile guidance and control for a purpose connected with the commission, preparation or instigation of terrorism”

Admitting the romanticism of building one’s own arsenal, how successful do you think an individual or even a large group of individuals would be at building and testing a guided missile?

Here’s a broad outline of the major steps to building a laser guided missile:

The Manufacturing Process

Constructing the body and attaching the fins

1 The steel or aluminum body is die cast in halves. Die casting involves pouring molten metal into a steel die of the desired shape and letting the metal harden. As it cools, the metal assumes the same shape as the die. At this time, an optional chromium coating can be applied to the interior surfaces of the halves that correspond to a completed missile’s cavity. The halves are then welded together, and nozzles are added at the tail end of the body after it has been welded.

2 Moveable fins are now added at predetermined points along the missile body. The fins can be attached to mechanical joints that are then welded to the outside of the body, or they can be inserted into recesses purposely milled into the body.

Casting the propellant

3 The propellant must be carefully applied to the missile cavity in order to ensure a uniform coating, as any irregularities will result in an unreliable burning rate, which in turn detracts from the performance of the missile. The best means of achieving a uniform coating is to apply the propellant by using centrifugal force. This application, called casting, is done in an industrial centrifuge that is well-shielded and situated in an isolated location as a precaution against fire or explosion.

Assembling the guidance system

4 The principal laser components—the photo detecting sensor and optical filters—are assembled in a series of operations that are separate from the rest of the missile’s construction. Circuits that support the laser system are then soldered onto pre-printed boards; extra attention is given to optical materials at this time to protect them from excessive heat, as this can alter the wavelength of light that the missile will be able to detect. The assembled laser subsystem is now set aside pending final assembly. The circuit boards for the electronics suite are also assembled independently from the rest of the missile. If called for by the design, microchips are added to the boards at this time.

5 The guidance system (laser components plus the electronics suite) can now be integrated by linking the requisite circuit boards and inserting the entire assembly into the missile body through an access panel. The missile’s control surfaces are then linked with the guidance system by a series of relay wires, also entered into the missile body via access panels. The photo detecting sensor and its housing, however, are added at this point only for beam riding missiles, in which case the housing is carefully bolted to the exterior diameter of the missile near its rear, facing backward to interpret the laser signals from the parent aircraft.

Final assembly

6 Insertion of the warhead constitutes the final assembly phase of guided missile construction. Great care must be exercised during this process, as mistakes can lead to catastrophic accidents. Simple fastening techniques such as bolting or riveting serve to attach the warhead without risking safety hazards. For guidance systems that home-in on reflected laser light, the photo detecting sensor (in its housing) is bolted into place at the tip of the warhead. On completion of this final phase of assembly, the manufacturer has successfully constructed on of the most complicated, sophisticated, and potentially dangerous pieces of hardware in use today.

Quality Control

Each important component is subjected to rigorous quality control tests prior to assembly. First, the propellant must pass a test in which examiners ignite a sample of the propellant under conditions simulating the flight of a missile. The next test is a wind tunnel exercise involving a model of the missile body. This test evaluates the air flow around the missile during its flight. Additionally, a few missiles set aside for test purposes are fired to test flight characteristics. Further work involves putting the electronics suite through a series of tests to determine the speed and accuracy with which commands get passed along to the missile’s control surfaces. Then the laser components are tested for reliability, and a test beam is fired to allow examiners to record the photo detecting sensor’s ability to “read” the proper wavelength. Finally, a set number of completed guided missiles are test fired from aircraft or helicopters on ranges studded with practice targets.

Did Samata Ullah have the expertise and/or access to the expertise or manufacturing capability for any of those steps?

Moreover, could Samata Ullah have tested and developed a guided missile without someone noticing?

Possession of first principle reading materials, such as chemistry, rocket, missile, etc., manuals or guides is a clear sign an alleged jihadist is an armchair jihadist.

Another sign of an armchair jihadist, along with the possession of such reading materials, is their failure to obtain explosives, weapons, etc., in an effective way.

The United States, via the CIA and the US military, routinely distributes explosives and weapons around the world to various factions.

A serious jihadist need only travel to well known locations and get in line for explosives, RPGs (rocket-propelled grenades), mortars, etc.

Does the weapon in this photo look homemade?


Of course not! Anyone with a passport and a little imagination can possess a wide variety of harmful devices.

But then, they are not an armchair jihadist.

DIY missile/explosive reading clubs of jihadists are not threats to the public. Manufacturing of explosives and missiles are difficult and dangerous, tasks best left to professionals. They are more dangerous to each other than the general public.

When allocating law enforcement resources, remember that the only thing easier to acquire than weapons is possibly marijuana. Anyone planning on building weapons can be ignored as an armchair jihadist.

In the United States and the United Kingdom, law enforcement resources would be better spent in the pursuit of wealthy and governmental pedophiles.

PS: I started to edit the steps for building a guided missile for length but the description highlights the absurdity of the charges in question. Melting steel or aluminum and pouring it into a metal die? Please, that’s not a backyard activity. Neither is pouring molten rocket fuel using a centrifuge.

British and Irish Legal Information Institute

October 11th, 2016

British and Irish Legal Information Institute

From the webpage:

Welcome to BAILII, where you can find British and Irish case law & legislation, European Union case law, Law Commission reports, and other law-related British and Irish material. BAILII thanks The Scottish Council of Law Reporting for their assistance in establishing the Historic Scottish Law Reports project. BAILII also thanks Sentral for provision of servers. For more information, see About BAILII.

I ran across this wonderful legal resource while researching a legal issue in another post.

Obviously a great resource for legal research and scholars but also I suspect a great source of leisure reading, well, if you like that sort of thing.

The site also offered this handy list of world law resources:

When I said “leisure reading,” I was only partially joking. What we accept now as “the law,” wasn’t always so.

The history of how rights and obligations have evolved over centuries of human interaction are recorded in legislation and case law.

It is a history with all the mis-steps, failures, betrayals and intrigue that are commonplace in any human enterprise.


Parsing Foreign Law From News Reports (Warning For Journalists)

October 11th, 2016

Cory Doctorow‘s headline: Scotland Yard charge: teaching people to use crypto is an act of terrorism red-lined my anti-government biases.

I tend towards “unsound” reactions when free speech is being infringed upon.

But my alarm and perhaps yours as well. was needlessly provoked in this case.

Cory writes:

In other words, according to Scotland Yard, serving a site over HTTPS (as this one is) and teaching people to use crypto (as this site has done) and possessing a secure OS (as I do) are acts of terrorism or potential acts of terrorism. In some of the charges, the police have explicitly connected these charges with planning an act of terrorism, but in at least one of the charges (operating a site served over HTTPS and teaching people about crypto) the charge lacks this addendum — the mere act is considered worthy of terrorism charges.

The concern over:

but in at least one of the charges (operating a site served over HTTPS and teaching people about crypto) the charge lacks this addendum — the mere act is considered worthy of terrorism charges.

is mis-placed.

Cory points to the original report here: Man arrested on Cardiff street to face six terror charges by Viram Dodd.

Cory’s alarm is not repeated by Dodd:

Ullah has been charged with directing terrorism, providing training in encryption programs knowing the purpose was for terrorism, and using his blog site to provide such training. His activities are alleged to have “the intention of assisting another or others to commit acts of terrorism”.

Beyond that (I haven’t seen the charging document), be aware that under English Criminal Procedure, the “charge” on which Cory places so much weight is defined as:


Pay particular attention to 7.3(1)(a)(i) (page 65):

…describes the offense in ordinary language, and…

A “charge” isn’t a technical specification of an offense under English criminal procedure. Which means you attach legal significance to charging language at your own peril. And to the detriment of your readers.

PS: I have contacted the Westminster Magistrates’ Court and requested a copy of the charging document. If and when that arrives, I will update this post with it.

Bias in Data Collection: A UK Example

October 10th, 2016

Kelly Fiveash‘s story, UK’s chief troll hunter targets doxxing, virtual mobbing, and nasty images starts off:

Trolls who hurl abuse at others online using techniques such as doxxing, baiting, and virtual mobbing could face jail, the UK’s top prosecutor has warned.

New guidelines have been released by the Crown Prosecution Service to help cops in England and Wales determine whether charges—under part 2, section 44 of the 2007 Serious Crime Act—should be brought against people who use social media to encourage others to harass folk online.

It even includes “encouraging” statistics:

According to the most recent publicly available figures—which cite data between May 2013 and December 2014—1,850 people were found guilty in England and Wales of offences under section 127 of the Communications Act 2003. But the numbers reveal a steady climb in charges against trolls. In 2007, there were a total of 498 defendants found guilty under section 127 in England and Wales, compared with 693 in 2008, 873 in 2009, 1,186 in 2010 and 1,286 in 2011.

But the “most recent publicly available figures,” doesn’t ring true does it?

Imagine that, 1850 trolls out of a total population of England and Wales of 57 million. (England 53.9 million, Wales 3.1 million, mid-2013)


Let’s look at the referenced government data, 25015 Table.xls.

For the months of May 2013 to December 2014, there are only monthly totals of convictions.

What data is not being collected?

Among other things:

  1. Offenses reported to law enforcement
  2. Offenses investigated by law enforcement (not the same as #1)
  3. Conduct in question
  4. Relationship, if any, between the alleged offender/victim
  5. Race, economic status, location, social connections of alleged offender/victim
  6. Law enforcement and/or prosecutors involved
  7. Disposition of cases without charges being brought
  8. Disposition of cases after charges brought but before trial
  9. Charges dismissed by courts and acquittals
  10. Judges who try and/or dismiss charges
  11. Penalties imposed upon guilty plea and/or conviction
  12. Appeals and results on appeal, judges, etc.

All that information exists for every reported case of “trolls,” and is recorded at some point in the criminal justice process or could be discerned from those records.

Can you guess who isn’t collecting that information?

The TheyWorkForYou site reports at: Communications Act 2003, Jeremy Wright, The Parliamentary Under-Secretary of State for Justice, saying:

The Ministry of Justice Court Proceedings Database holds information on defendants proceeded against, found guilty and sentenced for criminal offences in England and Wales. This database holds information on offences provided by the statutes under which proceedings are brought but not the specific circumstances of each case. It is not possible to separately identify, in all cases brought under section 127 of the Communications Act 2003, whether a defendant sent or caused to send information to an individual or a small group of individuals or made the information widely available to the public. This detailed information may be held by the courts on individual case files which due to their size and complexity are not reported to Justice Analytical Services. As such this information can be obtained only at disproportionate cost.
… (emphasis added)

I was unaware that courts in England and Wales were still recording their proceedings on vellum. That would be expensive to manually gather that data together. (NOT!)

How difficult is it from any policy organization, whether seeking greater protection from trolls and/or opposing classes of prosecution based on discrimination and free speech to gather the same data?

Here is a map of the Crown Prosecution Service districts:


Counting the sub-offices in each area, I get forty-three separate offices.

But that’s only cases that are considered for prosecution and that’s unlikely to be the same number as reported to the police.

Checking for police districts in England, I get thirty-nine.


Plus, another four areas for Wales:


The Wikipedia article List of law enforcement agencies in the United Kingdom, Crown dependencies and British Overseas Territories has links for all these police areas, which in the interest of space, I did not repeat here.

I wasn’t able to quickly find a map of English criminal courts, although you can locate them by postcode at: Find the right court or tribunal. My suspicion is that Crown Prosecution Service areas correspond to criminal courts. But verify that for yourself.

In order to collect the information already in the possession of the government, you would have to search records in 43 police districts, 43 Crown Prosecution Service offices, plus as many as 43 criminal courts in which defendants may be prosecuted. All over England and Wales. With unhelpful clerks all along the way.

All while the government offers the classic excuse:

As such this information can be obtained only at disproportionate cost.

Disproportionate because:

Abuse of discretion, lax enforcement, favoritism, discrimination by police officers, Crown prosecutors, judges could be demonstrated as statistical facts?

Governments are old hands at not collecting evidence they prefer to not see thrown back in their faces.

For example: FBI director calls lack of data on police shootings ‘ridiculous,’ ‘embarrassing’.

Non-collection of data is a source of bias.

What bias is behind the failure to collect troll data in the UK?

When 24 GB Of Physical RAM Pegs At 98% And Stays There

October 9th, 2016

Don’t panic! It has a happy ending but I’m too tired to write it up for posting today.

Tune in tomorrow for lessons learned on FOIA answers that don’t set the information free.

Chasing File Names – Check My Work

October 8th, 2016

I encountered a stream of tweets of which the following are typical:


Hmmm, is cf.7z a different set of files from ebd-cf.7z?

You could “eye-ball” the directory listings but that is tedious and error-prone.

Building on what we saw in Guccifer 2.0’s October 3rd 2016 Data Drop – Old News? (7 Duplicates out of 2085 files), let’s combine cf-7z-file-Sorted-Uniq.txt and ebd-cf-file-Sorted-Uniq.txt, and sort that file into cf-7z-and-ebd-cf-files-Sorted.txt.


uniq -d cf-7z-and-ebd-cf-files-Sorted.txt | wc -l

(“-d” for duplicate lines) on the resulting file, piping it into wc -l, will give you the result of 2177 duplicates. (The total length of the file is 4354 lines.)


uniq -u cf-7z-and-ebd-cf-files-Sorted.txt

(“-u” for unique lines), will give you no return (no unique lines).

With experience, you will be able to check very large file archives for duplicates. In this particular case, despite the circulating under different names, it appears these two archives contain the same files.

BTW, do you think a similar technique could be applied to spreadsheets?