Fast search of thousands of short-read sequencing experiments [NEW! Sequence Bloom Tree]

February 8th, 2016

Fast search of thousands of short-read sequencing experiments by Brad Solomon & Carl Kingsford.

Abstract from the “official” version at Nature Biotechnology (2016):

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

That will set you back $32 for the full text and PDF.

Or, you can try the unofficial version:


Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases.

We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts.

The implementation used in the experiments below is in C++ and is available as open source at∼ckingsf/software/bloomtree.

You will probably be interested in review comments by C. Titus Brown, Thoughts on Sequence Bloom Trees.

As of today, the exact string “Sequence Bloom Tree” gathers only 207 “hits” so the literature is still small enough to be read.

Don’t delay overlong pursuing this new search technique!

I first saw this in a tweet by Stephen Turner.

The 2016 cyber security roadmap – [Progress on Security B/C of Ransomware?]

February 8th, 2016

The 2016 cyber security roadmap by Chloe Green.

From the post:

2014 was heralded as the ‘year of the data breach’ – but we’d seen nothing yet. From unprecedented data theft to crippling hacktivism attacks and highly targeted state-sponsored hacks, 2015 has been the bleakest year yet for the cyber security of businesses and organisations.

High profile breaches at Ashley Madison, TalkTalk and JD Wetherspoons have brought the protection of personal and enterprise data into the public consciousness.

In the war against cybercrime, companies are facing off against ever more sophisticated and crafty approaches, while the customer data they hold grows in value, and those that fail to protect it find themselves increasingly in the media and legislative spotlight with nowhere to hide.

We asked a panel of leading industry experts to highlight the major themes for enterprise cyber security in 2016 and beyond.

There isn’t a lot of comfort coming from industry experts these days. Some advice on mitigating strategies and a warning that ransomeware is about to come into its own in 2016. I believe the phrase was “…corporate and not consumer rates…” for ransoms.

A surge in rasonware may be a good thing for the software industry. It would fix a cost for insecure software and practices.

When ransomware extracts commercially unacceptable costs from users of software, users will demand better software from developers.

Financial incentives all the way around. Incentives for hackers to widely deploy ransomeware, incentives for software users to watch their bottom line and last but not least, incentives for developers to implement more robust testing and development processes.

Ransomware may do what reams of turgid prose in journals, conference presentations, books and classrooms have failed to do. Ransomware can create financial incentives for software users to demand better software engineering and testing. Not to mention liability for defects in software.

Faced with financial demands, the software industry will be forced to adopt better software development processes. Those unable to produce sufficiently secure (no software being perfect) software will collapse under the weight of falling sales or liability litigation.

Hackers will be forced to respond to improvement in software quality, for their own financial gain, creating a virtuous circle of immproving software security.

A Gentle Introduction to Category Theory (Feb 2016 version)

February 8th, 2016

A Gentle Introduction to Category Theory (Feb 2016 version) by Peter Smith.

From the preface:

This Gentle Introduction is work in progress, developing my earlier ‘Notes onBasic Category Theory’ (2014–15).

The gadgets of basic category theory fit together rather beautifully in mul-tiple ways. Their intricate interconnections mean, however, that there isn’t asingle best route into the theory. Different lecture courses, different books, canquite appropriately take topics in very different orders, all illuminating in theirdifferent ways. In the earlier Notes, I roughly followed the order of somewhatover half of the Cambridge Part III course in category theory, as given in 2014by Rory Lucyshyn-Wright (broadly following a pattern set by Peter Johnstone;see also Julia Goedecke’s notes from 2013). We now proceed rather differently.The Cambridge ordering certainly has its rationale; but the alternative orderingI now follow has in some respects a greater logical appeal. Which is one reasonfor the rewrite.

Our topics, again in different arrangements, are also covered in (for example)Awodey’s good but uneven Category Theory and in Tom Leinster’s terrific – and appropriately titled – Basic Category Theory. But then, if there are some rightly admired texts out there, not to mention various sets of notes on category theory available online (see here), why produce another introduction to category theory?

I didn’t intend to! My goal all along has been to get to understand what light category theory throws on logic, set theory, and the foundations of mathematics. But I realized that I needed to get a lot more securely on top of basic category theory if I was eventually to pursue these more philosophical issues. So my earlier Notes began life as detailed jottings for myself, to help really fix ideas: and then – as can happen – the writing has simply taken on its own momentum. I am still concentrating mostly on getting the technicalities right and presenting them in apleasing order: I hope later versions will contain more motivational/conceptual material.

What remains distinctive about this Gentle Introduction, for good or ill, is that it is written by someone who doesn’t pretend to be an expert who usually operates at the very frontiers of research in category theory. I do hope, however,that this makes me rather more attuned to the likely needs of (at least some)beginners. I go rather slowly over ideas that once gave me pause, spend more time than is always usual in motivating key ideas and constructions, and I have generally aimed to be as clear as possible (also, I assume rather less background mathematics than Leinster or even Awodey). We don’t get terribly far: however,I hope that what is here may prove useful to others starting to get to grips with category theory. My own experience certainly suggests that initially taking things at a rather gentle pace as you work into a familiarity with categorial ways of thinking makes later adventures exploring beyond the basics so very much more manageable.

Check the Category Theory – Reading List, also by Peter Smith, to make sure you have the latest version of this work.

Be an active reader!

If you spot issues with the text:

Corrections, please, to ps218 at cam dot ac dot uk.

At the category theory reading page Peter mentions having retired after forty years in academia.

Writing an introduction to category theory! What a great way to spend retirement!

(Well, different people have different tastes.)

International Conference on Learning Representations – Accepted Papers

February 8th, 2016

International Conference on Learning Representations – Accepted Papers

From the conference overview:

It is well understood that the performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. The rapidly developing field of representation learning is concerned with questions surrounding how we can best learn meaningful and useful representations of data. We take a broad view of the field, and include in it topics such as deep learning and feature learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

Despite the importance of representation learning to machine learning and to application areas such as vision, speech, audio and NLP, there was no venue for researchers who share a common interest in this topic. The goal of ICLR has been to help fill this void.

That should give you an idea of the range of data representations/features that you will encounter in the eighty (80) papers accepted for the conference.

ICLR 2016 will be held May 2-4, 2016 in the Caribe Hilton, San Juan, Puerto Rico.

Time to review How To Read A Paper!


I first saw this in a tweet by Hugo Larochelle.

Governments Race To Bottom On Privacy Rights

February 8th, 2016

British spies want to be able to suck data out of US Internet giants by Cory Doctorow.

Cory points out a recent US/UK agreement subjects U.S. citizens to surveillance under British laws that no one understands and that don’t require even a fig leaf of judicial approval.

The people of the United States fought one war to free themselves of arbitrary and capricious British rule. Declaration of Independence.

Is the stage being set for a war to enforce the constitution that resulted from the last war the United States waged against the UK?

Data from the World Health Organization API

February 8th, 2016

Data from the World Health Organization API by Peter’s stats stuff – R.

From the post:

Eric Persson released yesterday a new WHO R package which allows easy access to the World Health Organization’s data API. He’s also done a nice vignette introducing its use.

I had a play and found it was easy access to some interesting data. Some time down the track I might do a comparison of this with other sources, the most obvious being the World Bank’s World Development Indicators, to identify relative advantages – there’s a lot of duplication of course. It’s a nice problem to have, too much data that’s too easy to get hold of. I wish we’d had that problem when I studied aid and development last century – I vividly remember re-keying numbers from almanac-like hard copy publications, and pleased we were to have them too!

Here’s a plot showing country-level relationships between the latest data of three indicators – access to contraception, adolescent fertility, and infant mortality – that help track the Millennium Development Goals.

With visualizations and R code!

A nice way to start off your data mining week!


I first saw this in a tweet by Christophe Lalanne.

Does Not Advertise With Google? (Rigging Search Results)

February 8th, 2016

I ask about because when I search on Google with the string:

honest society member

I get 82,100,000 “hits” and the first page is entirely, honor society stuff.

No, “did you mean,” or “displaying results for…”, etc.

Not a one.

Top of the second page of results did have a webpage that mentions, but not their home site.

I can’t recall seeing an Honestsociety ad with Google and thought perhaps one of you might.

Lacking such ads, my seat of the pants explanation for “honest society member” returning the non-responsive “honor society” listing isn’t very generous.

What anomalies have you observed in Google (or other) search results?

What searches would you use to test ranking in search results by advertiser with Google versus non-advertiser with Google?

Rigging Searches

For my part, it isn’t a question of whether search results are rigged or not, but rather are they rigged the way I or my client prefers?

Or to say it in a positive way: All searches are rigged. If you think otherwise, you haven’t thought very deeply about the problem.

Take library searches for example. Do you think they are “fair” in some sense of the word?

Hmmm, would you agree that the collection practices of a library will give a user an impression of the literature on a subject?

So the search itself isn’t “rigged,” but the data underlying the results certainly influences the outcome.

If you let me pick the data, I can guarantee whatever search result you want to present. Ditto for the search algorithms.

The best we can do is make our choices with regard to the data and algorithms explicit, so that others accept our “rigged” data or choose to “rig” it differently.

The Danger of Ad Hoc Data Silos – Discrediting Government Experts

February 8th, 2016

This Canadian Lab Spent 20 Years Ruining Lives by Tess Owen.

From the post:

Four years ago, Yvonne Marchand lost custody of her daughter.

Even though child services found no proof that she was a negligent parent, that didn’t count for much against the overwhelmingly positive results from a hair test. The lab results said she was abusing alcohol on a regular basis and in enormous quantities.

The test results had all the trappings of credible forensic science, and was presented by a technician from the Motherisk Drug Testing Laboratory at Toronto’s Sick Kids Hospital, Canada’s foremost children’s hospital.

“I told them they were wrong, but they didn’t believe me. Nobody would listen,” Marchand recalls.

Motherisk hair test results indicated that Marchand had been downing 48 drinks a day, for 90 days. “If you do the math, I would have died drinking that much” Marchand says. “There’s no way I could function.”

The court disagreed, and determined Marchand was unfit to have custody of her daughter.

Some parents, like Marchand, pursued additional hair tests from independent labs in a bid to fight their cases. Marchand’s second test showed up as negative. But, because the lab technician couldn’t testify as an expert witness, the second test was thrown out by the court.

Marchand says the entire process was very frustrating. She says someone should have noticed a pattern when parents repeatedly presented hair test results from independent labs which completely contradicted Motherisk results. Alarm bells should have gone off sooner.

Tess’ post and a 366-page report make it clear that Motherisk has impaired the fairness of a large number of child-protection service cases.

Child services, the courts, state representatives, the only one would would have been aware of contradictions of Motherisk results over multiple cases, had not interest in “connecting the dots.”

Each case, with each attorney, was an ad hoc data silo that could not present the pattern necessary to challenge the systematic poor science from Motherisk.

The point is that not all data silos are in big data or nation-state sized intelligence services. Data silos can and do regularly have tragic impact upon ordinary citizens.

Privacy would be an issue but mechanisms need to be developed where lawyers and other advocates can share notice of contradiction of state agencies so that patterns such as by Motherisk can be discovered, documented and hopefully ended sooner rather than later.

BTW, there is an obvious explanation for why:

“No forensic toxicology laboratory in the world uses ELISA testing the way [Motherisk] did.”

Child services did not send hair samples to Motherisk to decide whether or not to bring proceedings.

Child services had already decided to remove children and sent hair samples to Motherisk to bolster their case.

How bright did Motherisk need to be to realize that positive results were expected outcome?

Does your local defense bar collect data on police/state forensic experts and their results?

Looking for suggestions?

Interpretation Under Ambiguity [First Cut Search Results]

February 7th, 2016

Interpretation Under Ambiguity by Peter Norvig.

From the paper:


This paper is concerned with the problem of semantic and pragmatic interpretation of sentences. We start with a standard strategy for interpretation, and show how problems relating to ambiguity can confound this strategy, leading us to a more complex strategy. We start with the simplest of strategies:

Strategy 1: Apply syntactic rules to the sentence to derive a parse tree, then apply semantic rules to get a translation into some logical form, and finally do a pragmatic interpretation to arrive at the final meaning.

Although this strategy completely ignores ambiguity, and is intended as a sort of strawman, it is in fact a commonly held approach. For example, it is approximately the strategy assumed by Montague grammar, where `pragmatic interpretation’ is replaced by `model theoretic interpretation.’ The problem with this strategy is that ambiguity can strike at the lexical, syntactic, semantic, or pragmatic level, introducing multiple interpretations. The obvious way to counter this problem is as follows:

Strategy 2: Apply syntactic rules to the sentence to derive a set of parse trees, then apply semantic rules to get a set of translations in some logical form, discarding any inconsistent formulae. Finally compute pragmatic interpretation scores for each possibility, to arrive at the `best’ interpretation (i.e. `most consistent’ or `most likely’ in the given context).

In this framework, the lexicon, grammar, and semantic and pragmatic interpretation rules determine a mapping between sentences and meanings. A string with exactly one interpretation is unambiguous, one with no interpretation is anomalous, and one with multiple interpretations is ambiguous. To enumerate the possible parses and logical forms of a sentence is the proper job of a linguist; to then choose from the possibilities the one “correct” or “intended” meaning of an utterance is an exercise in pragmatics or Artificial Intelligence.

One major problem with Strategy 2 is that it ignores the difference between sentences that seem truly ambiguous to the listener, and those that are only found to be ambiguous after careful analysis by the linguist. For example, each of (1-3) is technically ambiguous (with could signal the instrument or accompanier case, and port could be a harbor or the left side of a ship), but only (3) would be seen as ambiguous in a neutral context.

(1) I saw the woman with long blond hair.
(2) I drank a glass of port.
(3) I saw her duck.

Lotfi Zadeh (personal communication) has suggested that ambiguity is a matter of degree. He assumes each interpretation has a likelihood score attached to it. A sentence with a large gap between the highest and second ranked interpretation has low ambiguity; one with nearly-equal ranked interpretations has high ambiguity; and in general the degree of ambiguity is inversely proportional to the sharpness of the drop-off in ranking. So, in (1) and (2) above, the degree of ambiguity is below some threshold, and thus is not noticed. In (3), on the other hand, there are two similarly ranked interpretations, and the ambiguity is perceived as such. Many researchers, from Hockett (1954) to Jackendoff (1987), have suggested that the interpretation of sentences like (3) is similar to the perception of visual illusions such as the Necker cube or the vase/faces or duck/rabbit illusion. In other words, it is possible to shift back and forth between alternate interpretations, but it is not possible to perceive both at once. This leads us to Strategy 3:

Strategy 3: Do syntactic, semantic, and pragmatic interpretation as in Strategy 2. Discard the low-ranking interpretations, according to some threshold function. If there is more than one interpretation remaining, alternate between them.

Strategy 3 treats ambiguity seriously, but it leaves at least four problems untreated. One problem is the practicality of enumerating all possible parses and interpretations. A second is how syntactic and lexical preferences can lead the reader to an unlikely interpretation. Third, we can change our mind about the meaning of a sentence-“at first I thought it meant this, but now I see it means that.” Finally, our affectual reaction to ambiguity is variable. Ambiguity can go unnoticed, or be humorous, confusing, or perfectly harmonious. By `harmonious,’ I mean that several interpretations can be accepted simultaneously, as opposed to the case where one interpretation is selected. These problems will be addressed in the following sections.

Apologies for the long introduction quote but I want to entice you to read Norvig’s essay in full and if you have the time, the references that he cites.

It’s the literature you will have to master to use search engines and develop indexing strategies.

At least for one approach to search and indexing.

That within a language there is enough commonality for automated indexing or searching to be useful has been proven over and over again by Internet search engines.

But at the same time, the first twenty or so results typically leave you wondering what interpretation the search engine put on your words.

As I said, Peter’s approach is useful, at least for a first cut at search results.

The problem is that the first cut has become the norm for “success” of search results.

That works if I want to pay lawyers, doctors, teachers and others to find the same results as others have found before (past tense).

That cost doesn’t appear as a line item in any budget but repetitive “finding” of the same information over and over again is certainly a cost to any enterprise.

First cut on semantic interpretation, follow Norvig.

Saving re-finding costs and the cost of not-finding, requires something more robust than a one model to find words and in the search darkness bind them to particular meanings.

PS: See for an extensive set of resources, papers, presentations, etc.

I first saw this in a tweet by James Fuller.

‘Avengers’ Comic Book Covers [ + MAD, National Lampoon]

February 7th, 2016

50 Years of ‘Avengers’ Comic Book Covers Through Color by Jon Keegan.

From the post:

When Marvel’s “Avengers: Age of Ultron” opens in theaters next month, a familiar set of iconic colors will be splashed across movie screens world-wide: The gamma ray-induced green of the Hulk, Iron Man’s red and gold armor, and Captain America’s red, white and blue uniform.

How the Avengers look today differs significantly from their appearance in classic comic-book versions, thanks to advancements in technology and a shift to a more cinematic aesthetic. As Marvel’s characters started to appear in big-budget superhero films such as “X-Men” in 2000, the darker, muted colors of the movies began to creep into the look of the comics. Explore this shift in color palettes and browse more than 50 years of “Avengers” cover artwork below. Read more about this shift in color.

The fifty years of palettes are a real treat and should be used alongside your collection of the Avenger comics for the same time period. ;-)

From what I could find quickly, you will have to purchase the forty year collection separately from more recent issues.

Of course, if you really want insight into American culture, you would order Absolutely MAD Magazine – 50+ Years.

MAD issues from 1952 to 2005 (17,500 pages in full color). Annotating those issues to include social context would be a massive but highly amusing project. And you would have to find a source for the following issues.

A more accessible collection that is easily as amusing as MAD would be the National Lampoon collection. Unfortunately, only 1970 – 1975 are online. :-(

One of my personal favorites:


Visualization of covers is a “different” way to view all of these collections and with no promises, could be interesting comparisons to contemporary events when they were published.

Mapping the commentaries you will find in MAD and National Lampoon to current events when they were published, say to articles in New York Time historical archive, would be a great history project for students and an education in social satire as well.

If anyone objects to the lack of a “serious” nature of such a project, be sure to remind them that reading the leading political science journal of the 1960’s, the American Political Science Review would have left the casual reader with few clues that the United States was engaged in a war that would destroy the lives of millions in Vietnam.

In my experience, “serious” usually equates with “supports the current system of privilege and prejudice.”

You can be “serious” or you can choose to shape a new system of privilege and prejudice.

Your call.

Clojure for Data Science [Caution: Danger of Buyer’s Regret]

February 6th, 2016

Clojure for Data Science by Mike Anderson.

From the webpage:

Presentation given at the Jan 2016 Singapore Clojure Users’ Group

You will have to work at the presentation because there is no accompanying video, but the effort will be well spent.

Before you review these slides or pass them onto others, take fair warning that you may experience “buyer’s regret” with regard to your current programming language/paradigm (if not already Clojure).

However powerful and shiny your present language seems now, its luster will be dimmed after scanning over this slides.

Don’t say you weren’t warned ahead of time!

BTW, if you search for “clojure for data science” (with the quotes) you will find among other things:

Clojure for Data Science Progressing by Henry Garner (Packt)

Repositories for the Clojure for Data Science Processing book.

@cljds Clojure Data Science twitter feed (Henry Garner). VG!

Clojure for Data Science Some 151 slides by Henry Garner.


Planet Clojure, a metablog that collects posts from other Clojure blogs.

As a close friend says from time to time, “clojure for data science,”

G*****s well.;-)


Between the Words [Alternate Visualizations of Texts]

February 6th, 2016

Between the Words – Exploring the punctuation in literary classics by Nicholas Rougeux.

From the webpage:

Between the Words is an exploration of visual rhythm of punctuation in well-known literary works. All letters, numbers, spaces, and line breaks were removed from entire texts of classic stories like Alice’s Adventures in Wonderland, Moby Dick, and Pride and Prejudice—leaving only the punctuation in one continuous line of symbols in the order they appear in texts. The remaining punctuation was arranged in a spiral starting at the top center with markings for each chapter and classic illustrations at the center.

The posters are 24″ X 36.”

Some small images to illustrate the concept:




I’m not an art critic but I can say that unusual or unexpected visualizations of data can lead to new insights. Or should I say different insights than you may have previously held.

Seeing this visualization reminded me of a presentation too any years ago at Cambridge that argued the cantillation (think crudely “accents”) marks in the Hebrew Bible were a reliable guide to clause boundaries and reading.

FYI, the versification and divisions in the oldest known witnesses to the Hebrew Bible were added centuries after the text stabilized. There are generally accepted positions on the text but at best, they are just that, generally accepted positions.

Any number of alternative presentations of texts suggest themselves.

I haven’t performed the experiment but for numeric data, reordering the data so as to force re-casting of formulas, could be a way to explore presumptions that are glossed over the the “usual form.”

Not unlike copying a text by hand as opposed to typing or photocopying the text. Each step of performing the task with less deliberation increases the odds you will miss some decision that you are making unconsciously.

If you like these posters ore know an English major/professor who may, pass this site along to them. (I have no interest, financial or otherwise in this site but I like to encourage creative thinking.)

I first saw this in a tweet by Christopher Phipps.

Finding Roman Roads

February 6th, 2016

You (yes, you) can find Roman roads using data collected by lasers by Barbara Speed.

Barbara reports that using Lidar data available from the UK Survey portal, David Rateledge was able to discover a Roman road between Ribchester and Lancaster.

She closes with:

The Environment Agency is planning to release 11 Terabytes (for Luddites: that’s an awful lot of data) worth of LIDAR information as part of the Department for Engironment, Food and Rural Affairs’ open data initiative, available through this portal. Which means that any of us could download it and dig about for more lost roads.

That seems a bit thin on the advice side, if you are truly interested in using the data to find Roman roads and other sites.

An article posted under ‘Lost’ Roman road is discovered, doesn’t provide more on the technique but does point to Roman Roads in Lancashire. Interesting site but no help on using the data.

I can’t comment on the ease of use or documentation but LiDAR tools are available at: Free LiDAR tools.

See also my post on the OpenTopography Project.

How To Profit from Human Trafficking – Become a Trafficker or NGO

February 6th, 2016

Special Report: Money and Lies in Anti-Human Trafficking NGOs by Anne Elizabeth Moore.

From the post:

The United States’ beloved – albeit disgraced – anti-trafficking advocate Somaly Mam has been waging a slow but steady return to glory since a Newsweek cover story in May 2014 led to her ousting from the Cambodian foundation that bore her name. The allegations in the article were not new; they’d been reported and corroborated in bits and pieces for years. The magazine simply pointed out that Mam’s personal narrative as a survivor of sex trafficking and the similar stories that emerged from both clients and staff at the non-governmental organization (NGO) she founded to assist survivors of sex trafficking, were often unverifiable, if not outright lies.

Panic ensued. Mam had helped establish, for US audiences, key plot points in the narrative of trafficking and its future eradication. Her story is that she was forced into labor early in life by someone she called “Grandfather,” who then sold off her virginity and forced her into a child marriage. Later she says she was sold to a brothel where she watched several contemporaries die in violence. Childhood friends and even family members couldn’t verify Mam’s recollection of events for Newsweek, but Mam has suggested that her story is typical of trafficking victims.

Mam has also cultivated a massive global network of anti-trafficking NGOs, funders and supporters, who have based their missions, donations and often life’s work on her emotional – but fabricated – tale. Some distanced themselves from the Cambodian activist last spring, including her long-time supporter at The New York Times, Nicholas Kristof, while others suggested that even if untrue, Mam’s stories were told in support of a worthy cause and were therefore true enough.

Moore characterizes NGOs organized to stop human trafficking as follows:

Considering their common mythical enemy – the nameless and faceless men portrayed in TV dramas who trade in nubile human girl stock – one would hope anti-trafficking organizations would unite in an effort to be less shady. With names reliant on metaphors of recovery, light and sanctuary, anti-trafficking groups project an image of transparency. Yet these groups have shown a remarkable lack of fiscal accountability and organizational consistency, often even eschewing an open acknowledgement of board members, professional affiliates and funding relationships. The problems with this evasion go beyond ethical considerations: A certain level of budgetary disclosure, for example, is a legal requirement for tax-exempt 501(c)(3) organizations. Yet anti-trafficking groups fold, move, restructure and reappear under new names with alarming frequency, making them almost as difficult to track as their supposed foes.

It is a very compelling article that will leave you with more questions about the finances of NGOs “opposing” human trafficking than answers.

The lack of answers isn’t Moore’s fault, the NGOs in question were designed to make obtaining answers difficult, if not impossible.

After you read the article, more than once to get the full impact, how would you:

  1. Track organizations in the article that: “…fold, move, restructure and reappear under new names with alarming frequency…”?
  2. How would you gather and share data on those organizations?
  3. How would you map what data is available on funding to Moore’s report?
  4. How would you make Moore’s snapshot of data subject updating by later reporters?
  5. How would you track the individuals involved in the NGOs you track?

The answers to those questions are applicable to human traffickers as well.

Consider it to be a “two-for.”

The Vietnam War: A Non-U.S. Photo Essay

February 6th, 2016

1965-1975 Another Vietnam by Alex Q. Arbuckle.

From the post:

For much of the world, the visual history of the Vietnam War has been defined by a handful of iconic photographs: Eddie Adams’ image of a Viet Cong fighter being executed, Nick Ut’s picture of nine-year-old Kim Phúc fleeing a napalm strike, Malcolm Browne’s photo of Thích Quang Duc self-immolating in a Saigon intersection.

Many famous images of the war were taken by Western photographers and news agencies, working alongside American or South Vietnamese troops.

But the North Vietnamese and Viet Cong had hundreds of photographers of their own, who documented every facet of the war under the most dangerous conditions.

Almost all were self-taught, and worked for the Vietnam News Agency, the National Liberation Front, the North Vietnamese Army or various newspapers. Many sent in their film anonymously or under a nom de guerre, viewing themselves as a humble part of a larger struggle.

A timely reminder that Western media and government approved photographs are evidence for only one side of any conflict.

Efforts by Twitter and Facebook to censor any narrative other than a Western one on the Islamic State should be very familiar to anyone who remembers the “Western view only” from media reports in the 1960’s.

Censorship, whether during Vietnam or in opposition to the Islamic State, doesn’t make the “other” narrative go away. It cannot deny the facts known to residents in a war zone.

The only goal that censorship achieves and not always, is to keep the citizens of the censoring powers in ignorance. So much for freedom of speech. You can’t talk about what you don’t know about.

The essay uses images from Another Vietnam: Pictures of the War from the Other Side. I checked at National Geographic, the publisher, and it isn’t listed in their catalog. Used/new the book is about $160.00 and contains 180 never before published photographs.

Questions come to mind:

Where are the other North Vietnam/Viet Cong photos now? Shouldn’t those be documented, digitized and placed online?

Where are the Islamic States photos and videos that are purged from Twitter and Facebook?

The media is repeating the same mistake with the Islamic State that it made during Vietnam.

No reader can decide between competing narratives in the face of only one narrative.

Nor can they avoid making the same mistakes as have been made in the past.

Vietnam is a very good example of such a mistake.

Replacing the choices of other cultures with our own is a mission doomed to failure (and defeat).

I first saw this in a tweet by Lars Marius Garshol.

Are You A Scientific Twitter User or Polluter?

February 6th, 2016

Realscientists posted this image to Twitter:


Self-Scoring Test:

In the last week, how often have you retweeted without “read[ing] the actual paper” pointed to by a tweet?

How many times did you retweet in total?

Formula: retweets w/o reading / retweets in total = % of retweets w/o reading.

No scale with superlatives because I don’t have numbers to establish a baseline for the “average” Twitter user.

I do know that I see click-bait, out-dated and factually wrong material retweeted by people who know better. That’s Twitter pollution.

Ask yourself: Am I a scientific Twitter user or a polluter?

Your call.

Is Twitter A Global Town Censor? (Data Project)

February 5th, 2016

Twitter Steps Up Efforts to Thwart Terrorists’ Tweets by Mike Isaac.

From the post:

For years, Twitter has positioned itself as a “global town square” that is open to discourse from all. And for years, extremist groups like the Islamic State have taken advantage of that stance, using Twitter as a place to spread their messages.

Twitter on Friday made clear that it was stepping up its fight to stem that tide. The social media company said it had suspended 125,000 Twitter accounts associated with extremism since the middle of 2015, the first time it has publicized the number of accounts it has suspended. Twitter also said it had expanded the teams that review reports of accounts connected to extremism, to remove the accounts more quickly.

“As the nature of the terrorist threat has changed, so has our ongoing work in this area,” Twitter said in a statement, adding that it “condemns the use of Twitter to promote terrorism.” The company said its collective moves had already produced results, “including an increase in account suspensions and this type of activity shifting off Twitter.”

The disclosure follows intensifying pressure on Twitter and other technology companies from the White House, presidential candidates like Hillary Clinton and government agencies to take more action to combat the digital practices of terrorist groups. The scrutiny has grown after mass shootings in Paris and San Bernardino, Calif., last year, because of concerns that radicalizations can be accelerated by extremist postings on the web and social media.

Just so you know what the Twitter rule is:

Violent threats (direct or indirect): You may not make threats of violence or promote violence, including threatening or promoting terrorism. (The Twitter Rules)

Here’s your chance to engage in real data science and help decide the question if Twitter had changed from global town hall to global town censor.

Here’s the data gathering project:

Monitor all the Twitter streams for Republican and Democratic candidates for the U.S. presidency for tweets advocating violence/terrorism.

File requests with Twitter for those accounts to be replaced.

FYI: When you report a message (Reporting a Tweet or Direct Message for violations), it will disappear from Messages inbox.

You must copy every tweet you report (accounts disappear as well) if you want to keep a record of your report.

Keep track of your reports and the tweet you copied before reporting.

Post the record of your reports and the tweets reported, plus any response from Twitter.

Suggestions on how to format these reports?

Or would you rather not know what Twitter is deciding for you?

How much data needs to be collected to move onto part 2 of the project – data analysis?

Suggestions on who at Twitter to contact for a listing of the 125,000 accounts that were silenced along with the Twitter history for each one? (Or the entire history of silenced accounts at Twitter? Who gets censored by topic, race, gender, location, etc., are all open questions.)

That could change the Twitter process from a black box to having marginally more transparency. You would have to guess at why any particular account was silenced.

If Twitter wants to take credit for censoring public discourse then the least it can do is be honest about who was censored and what they were saying to be censored.


Ethical Data Scientists: Will You Support A False Narrative – “Community of Hope?”

February 5th, 2016

Google executive Anthony House advocates a false narrative, a “community of hope” as a counter to truthful content from the Islamic State:

We should get the bad stuff down [online], but it’s also extremely important that people are able to find good information, that when people are feeling isolated, that when they go online, they find a community of hope, not a community of harm. (Google plans to fight extremist propaganda with AdWords)

Islamic State media is offering a community of hope. One based on facts, not a fantasy of Western planners.

The more immediate, but no less intractable, challenge is to change the reality on the ground in Syria and Iraq, so that ISIS’s narrative of Sunni Muslim persecution at the hands of the Assad regime and Iranian-backed Shiite militias commands less resonance among Sunnis. One problem in countering that narrative is that some of it happens to be true: Sunni Muslims are being persecuted in Syria and Iraq. This blunt empirical fact, just as much as ISIS’s success on the battlefield, and the rhetorical amplification and global dissemination of that success via ISIS propaganda, helps explain why ISIS has been so effective in recruiting so many foreign fighters to its cause. (Why It’s So Hard to Stop ISIS Propaganda)

Persecution of Sunni Muslims aren’t the only facts in the Islamic State narrative. Consider the following:

  • Muslim governments exist at the sufferance of the West. Ex. Afghanistan, Iran, Libya, Syria
  • Existing “Muslim” leaders are vassals of the West.
  • For more than a century the West has dictated the fate of Muslims in the Middle East.
  • The West supports oppression of the Palestinian people.
  • The West opposes democratic results in Muslim countries that don’t accord with its wishes.

We might disagree on the phrasing of those facts but can an ethical data scientist say they are not true?

Whatever the motivation of the West in each case, the West wants to decide the fate of Muslims.

Is the “community of hope” Google portrays to be based on false hopes or new realities on the ground?

There’s a question for all the “ethical” data scientists at Google.

Will you support a false narrative by Google for a “community of hope” to deter terrorism?

Beating Body Scanners

February 4th, 2016

Just on the off chance that some government mandates wholly ineffectual full body scanners for security purposes, Jonathan Corbett has two videos that demonstrate the ease with which such scanner can be defeated, completely.

Oh, I forgot, the US government has mandated such scanners!

Jonathan maintains a great site at: You can folow him @_JonCorbett.

Jon is right about the scanners being ineffectual but being effective wasn’t part of the criteria for purchasing the systems. Scanners were purchased to give the impression of frenzied activity, even if it was totally ineffectual.

What would happen if a terrorist did attack an airport, through one of the hundreds of daily lapses in security? What would the government say if it weren’t engaged in non-stop but meaningless activity?

Someone would say, falsely, that it was inactive on the part of government that enabled the attack.

Stuff and nonsense.

“Terrorist” attacks, actually violence committed by criminals by another name, can and will happen no matter what measures are taken by the government. Short of having an all-nude policy beginning at the perimeter of the airport and prohibiting anything larger than a clear quart zip lock bag being shipped. With passengers or as cargo.

Even then it isn’t hard to imagine several dozen ways to carry out “terrorist” attacks at any airport.

The sooner government leaders begin to educate their citizens that some risks are simply unavoidable, the sooner money can stop being wasted on visible but ineffectual efforts like easily defeated body scanners.

Comodo Chromodo browser – Danger! Danger! – Discontinue Use

February 4th, 2016

Comodo Chromodo browser does not enforce same origin policy and is based on an outdated version of Chromium

From the overview:

Comodo Chromodo browser, version,, and possibly earlier, does not enforce same origin policy, which allows for the possibility of cross-domain attacks by malicious or compromised web hosts. Chromodo is based on an outdated release of Chromium with known vulnerabilities.


The CERT/CC is currently unaware of a practical solution to this problem and recommends the following workarounds.

Disable JavaScript

Disabling JavaScript may mitigate cross-domain scripting attacks. For instructions, refer to Comodo’s help page.

Note that disabling JavaScript may not protect against known vulnerabilities in the version of Chromium on which Chromodo is based. For this reason, users should prioritize implementing the following workaround.

Discontinue use

Until these issues are addressed, consider discontinuing use of Chromodo.

Discontinue use is about as extreme a workaround as I can imagine.

Too bad the Comodo site doesn’t say anything about refunds and/or compensation for damaged customers.

Would you say that without any penalty, there is no incentive for Comodo to produce better software?

Or to put it differently, where is the downside to Comodo producing buggy software?

Where does that impact their bottom line?

I first saw this in a tweet by SecuriTay.

Toneapi helps your writing pack an emotional punch [Not For The Ethically Sensitive]

February 4th, 2016

Toneapi helps your writing pack an emotional punch by Martin Bryant.

From the post:

Language analysis is a rapidly developing field and there are some interesting startups working on products that help you write better.

Take Toneapi, for example. This product from Northern Irish firm Adoreboard is a Web-based app that analyzes (and potentially improves) the emotional impact of your writing.

Paste in some text, and it will offer a detailed visualization of your writing.

If you aren’t overly concerned about manipulating, sorry, persuading your readers to your point of view, you might want to give Toneapi a spin. Martin reports that IBM’s Watson has Tone Analyzer and you should also consider Textio and Relative Insight.

Before this casts an Orwellian pale over your evening/day, remember that focus groups and testing messages have been the staple of advertising for decades.

What these software services do is make a crude form of that capability available to the average citizen.

Some people have a knack for emotional language, like Donald Trump, but I can’t force myself to write in incomplete sentences or with one syllable words. Maybe there’s an app for that? Suggestions?

The Ethical Data Scientist

February 4th, 2016

The Ethical Data Scientist by Cathy O’Neil.

From the post:

After the financial crisis, there was a short-lived moment of opportunity to accept responsibility for mistakes with the financial community. One of the more promising pushes in this direction was when quant and writer Emanuel Derman and his colleague Paul Wilmott wrote the Modeler’s Hippocratic Oath, which nicely sums up the list of responsibilities any modeler should be aware of upon taking on the job title.

The ethical data scientist would strive to improve the world, not repeat it. That would mean deploying tools to explicitly construct fair processes. As long as our world is not perfect, and as long as data is being collected on that world, we will not be building models that are improvements on our past unless we specifically set out to do so.

At the very least it would require us to build an auditing system for algorithms. This would be not unlike the modern sociological experiment in which job applications sent to various workplaces differ only by the race of the applicant—are black job seekers unfairly turned away? That same kind of experiment can be done directly to algorithms; see the work of Latanya Sweeney, who ran experiments to look into possible racist Google ad results. It can even be done transparently and repeatedly, and in this way the algorithm itself can be tested.

The ethics around algorithms is a topic that lives only partly in a technical realm, of course. A data scientist doesn’t have to be an expert on the social impact of algorithms; instead, she should see herself as a facilitator of ethical conversations and a translator of the resulting ethical decisions into formal code. In other words, she wouldn’t make all the ethical choices herself, but rather raise the questions with a larger and hopefully receptive group.

First, the link for the Modeler’s Hippocratic Oath takes you to a splash page at Wiley for Derman’s book: My Life as a Quant: Reflections on Physics and Finance.

The Financial Modelers’ Manifesto (PDF) and The Financial Modelers’ Manifesto (HTML), are valid links as of today.

I commend the entire text of The Financial Modelers’ Manifesto to you for repeated reading but for present purposes, let’s look at the Modelers’ Hippocratic Oath:

~ I will remember that I didn’t make the world, and it doesn’t satisfy my equations.

~ Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.

~ I will never sacrifice reality for elegance without explaining why I have done so.

~ Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.

~ I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension

It may just be me but I don’t see a charge being laid on data scientists to be the ethical voices in organizations using data science.

Do you see that charge?

To to put it more positively, aren’t other members of the organization, accountants, engineers, lawyers, managers, etc., all equally responsible for spurring “ethical conversations?” Why is this a peculiar responsibility for data scientists?

I take a legal ethics view of the employer – employee/consultant relationship. The client is the ultimate arbiter of the goal and means of a project, once advised of their options.

Their choice may or may not be mine but I haven’t ever been hired to play the role of Jiminy Cricket.


It’s heady stuff to be responsible for bringing ethical insights to the clueless but sometimes the clueless have ethical insights on their on, or not.

Data scientists can and should raise ethical concerns but no more or less than any other member of a project.

As you can tell from reading this blog, I have very strong opinions on a wide variety of subjects. That said, unless a client hires me to promote those opinions, the goals of the client, by any legal means, are my only concern.

PS: Before you ask, no, I would not work for Donald Trump. But that’s not an ethical decision. That’s simply being a good citizen of the world.

Spontaneous Preference for their Own Theories (SPOT effect) [SPOC?]

February 4th, 2016

The SPOT Effect: People Spontaneously Prefer their Own Theories by Aiden P. Gregga, Nikhila Mahadevana, and Constantine Sedikidesa.


People often exhibit confirmation bias: they process information bearing on the truth of their theories in a way that facilitates their continuing to regard those theories as true. Here, we tested whether confirmation bias would emerge even under the most minimal of conditions. Specifically, we tested whether drawing a nominal link between the self and a theory would suffice to bias people towards regarding that theory as true. If, all else equal, people regard the self as good (i.e., engage in self-enhancement), and good theories are true (in accord with their intended function), then people should regard their own theories as true; otherwise put, they should manifest a Spontaneous Preference for their Own Theories (i.e., a SPOT effect). In three experiments, participants were introduced to a theory about which of two imaginary alien species preyed upon the other. Participants then considered in turn several items of evidence bearing on the theory, and each time evaluated the likelihood that the theory was true versus false. As hypothesized, participants regarded the theory as more likely to be true when it was arbitrarily ascribed to them as opposed to an “Alex” (Experiment 1) or to no one (Experiment 2). We also found that the SPOT effect failed to converge with four different indices of self-enhancement (Experiment 3), suggesting it may be distinctive in character.

I can’t give you the details on this article because it is fire-walled.

But the catch phrase, “Spontaneous Preference for their Own Theories (i.e., a SPOT effect)” certainly fits every discussion of semantics I have ever read or heard.

With a little funding you could prove the corollary, Spontaneous Preference for their Own Code (the SPOC effect) among programmers. ;-)

There are any number of formulations for how to fight confirmation bias but Jeremy Dean puts it this way:

The way to fight the confirmation bias is simple to state but hard to put into practice.

You have to try and think up and test out alternative hypothesis. Sounds easy, but it’s not in our nature. It’s no fun thinking about why we might be misguided or have been misinformed. It takes a bit of effort.

It’s distasteful reading a book which challenges our political beliefs, or considering criticisms of our favourite film or, even, accepting how different people choose to live their lives.

Trying to be just a little bit more open is part of the challenge that the confirmation bias sets us. Can we entertain those doubts for just a little longer? Can we even let the facts sway us and perform that most fantastical of feats: changing our minds?

I wonder if that includes imagining using JSON? (shudder) ;-)

Hard to do, particularly when we are talking about semantics and what we “know” to be the best practices.

Examples of trying to escape the confirmation bias trap and the results?

Perhaps we can encourage each other.

SQL Injection Hall-Of-Shame / Internet-of-Things Hall-Of-Shame

February 4th, 2016

SQL Injection Hall-Of-Shame by Arthur Hicken.

From the webpage:

In this day and age it’s ridiculous how frequently large organizations are falling prey to SQL Injection which is almost totally preventable as I’ve written previously.

Note that this is a work in progress. If I’ve missed something you’re aware of please let me know in the comments at the bottom of the page.

Don’t let this happen to you! For some simple tips see the OWASP SQL Injection Prevention Cheat Sheet. For more security info check out the security resources page and the book SQL Injection Attacks and Defense or Basics of SQL injection Analysis, Detection and Prevention: Web Security for more info.


With the rise of internet enabled devices in the Internet of Things or IoT the need for software security is becoming even more important. Unfortunately many device makers seem to put security on the back burner or not even understand the basics of cybersecurity.

I am maintaining here a list of known hacks for “things”. The list is short at the moment but will grow, and is often more generic than it could be. It’s kind of in reverse-chronological order, based on the date that the hack was published. Please assist – if you’re aware of additional thing-hacks please let me know in the comments at the bottom of the page.

I assume you find “wall-of-shame” efforts as entertaining as I do.

I am aware of honor-shame debates from a biblical studies perspective, on which see: Complete Bibliography of Honor-Shame Resources

“Complete” is a relative term when used regarding any bibliography in biblical studies and this appears to have at least one resource from 2011, but none later. You can run the references forward to collect more recent literature.

But the question with shaming techniques is are they effective?

As a case in point, consider Researchers find it’s terrifyingly easy to hack traffic lights where the post points out:

In fact, the most upsetting passage in the entire paper is the dismissive response issued by the traffic controller vendor when the research team presented its findings. According to the paper, the vendor responsible stated that it “has followed the accepted industry standard and it is that standard which does not include security.”

We can entertain ourselves by shaming vendors all day but only the “P” word will drive greater security.

“P” as in penalty.

Vormetric found that to be the case in What Drives Compliance? Hint: The P Word Missing From Cybersecurity Discussions.

Be entertained by wall-of-shame efforts but lobby for compliance enforced by penalties. (Know to anthropologists as a fear culture.)

Truthful Paedophiles On The Darknet?

February 4th, 2016

There is credibility flaw in Cryptopolitik and the Darknet by Daniel Moore & Thomas Rid that I overlooked yesterday (The Dark Web, “Kissing Cousins,” and Pornography) Perhaps it was just too obvious to attract attention.

Moore and Rid write:

The pornographic content was perhaps the most distressing. Websites dedicated to providing links to videos purporting to depict rape, bestiality and paedophilia were abundant. One such post at a supposedly nonaffiliated content-sharing website offered a link to a video of ‘a 12 year old girl … getting raped at school by 4 boys’.52 Other examples include a service that sold online video access to the vendor’s own family members:

My two stepsisters … will be pleased to show you their little secrets. Well, they are rather forced to show them, but at least that’s what they are used to.53

Several communities geared towards discussing and sharing illegitimate fetishes were readily available, and appeared to be active. Under the shroud of anonymity, various users appeared to seek vindication of their desires, providing words of support and comfort for one another in solidarity against what was seen as society’s unjust discrimination against non-mainstream sexual practices. Users exchanged experiences and preferences, and even traded content. One notable example from a website called Pedo List included a commenter freely stating that he would ‘Trade child porn. Have pics of my daughter.’54 There appears to be no fear of retribution or prosecution in these illicit communities, and as such users apparently feel comfortable enough to share personal stories about their otherwise stifled tendencies. (page 23)

Despite their description of hidden services as dens of iniquity and crime, those who use them are suddenly paragons of truthfulness, at least when it suits the authors purpose?

Doesn’t crediting the content of the Darknet as truthful, as opposed to being wishful, fantasy, or even police officers posing to investigate (some would say entrap) others, strain the imagination?

Some of the content is no doubt truthful but policy arguments need to be based on facts, not a collection of self-justifying opinions from like minded individuals.

A quick search on the string (without quotes):

police officers posing as children sex rings

Returns 9.7 million “hits.

How many of those police officers appeared in the postings collected by Moore & Rid it isn’t possible to say.

But in science, there is this thing called the burden of proof. That is simply asserting a conclusion, even citing equally non-evidence based conclusions, isn’t sufficient to prove a claim.

Moore & Rid had the burden to prove that the Darknet is a wicked place that poses all sorts of dangers and hazards.

As I pointed out yesterday, The Dark Web, “Kissing Cousins,” and Pornography, their “proof” is non-replicable conclusions about a small part of the Darkweb.

Earlier today I realized their conclusions depend upon a truthful criminal element using the Darkweb.

What do you think about the presumption that criminals are truthful?

Sounds doubtful to me!

The Dark Web, “Kissing Cousins,” and Pornography

February 3rd, 2016

Dark web is mostly illegal, say researchers by Lisa Vaas.

You can tell where Lisa comes out on the privacy versus law enforcement issue by the slant of her conclusion:

Users, what’s your take: are hidden services worth the political firestorm they generate? Are they worth criminals escaping justice?

Illegal is a slippery concept.

Marriage of first “kissing” cousins is “illegal” in:

Arkansas, Delaware, Idaho, Iowa, Kansas, Kentucky, Louisiana, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, South Dakota, Texas, Washington, West Virginia, and Wyoming.

Marriage of first “kissing” cousins is legal in:

Alabama, Alaska, California, Colorado, Connecticut, District of Columbia, Florida, Georgia, Hawaii, Maryland, Massachusetts, New Jersey, New Mexico, New York, North Carolina (first cousins but not double first), Rhode Island, South Carolina, Tennessee, Vermont, and Virginia.

There are some other nuances I didn’t capture and for those see: State Laws Regarding Marriages Between First Cousins.

If you read Cryptopolitik and the Darknet by Daniel Moore & Thomas Rid carefully, you will spot a number of problems with their methodology and reasoning.

First and foremost, no definitions were offered for their taxonomy (at page 20):

  • Arms
  • Drugs
  • Extremism
  • Finance
  • Hacking
  • Illegitimate pornography
  • Nexus
  • Other illicit
  • Social
  • Violence
  • Other
  • None

Readers and other researchers are left to wonder what was included or excluded from each of those categories.

In science, that would be called an inability to replicate the results. As if this were science.

Moore & Rid recite anecdotal accounts of particular pornography sites, calculated to shock the average reader, but that’s not the same thing as enabling replication of their research. Or a fair characterization of all the pornography encountered.

They presumed that text was equivalent to image content, so they discarded all images (pages 19-20). Which left them unable to test that presumption. Hmmm, untested assumptions in science?

The results of the unknown basis for classification identied 122 sites (page 21) as pornographic out of the 5,205 initial set of sites.

If you accept Tor’s estimate of 30,000 hidden services that announce themselves every day, Moore & Rid have found that illegal pornography (whatever that means) is:

122 / 30000 = 0.004066667

Moore & Rid have established that “illegal” porn is .004066667% of the Dark Net.

I should be grateful Moore & Rid have so carefully documented the tiny part of the Dark Web concerned with their notion of “illegal” pornography.

But, when you encounter “reasoning” such as:

The other quandary is how to deal with darknets. Hidden services have already damaged Tor, and trust in the internet as a whole. To save Tor – and certainly to save Tor’s reputation – it may be necessary to kill hidden services, at least in their present form. Were the Tor Project to discontinue hidden services voluntarily, perhaps to improve the reputation of Tor browsing, other darknets would become more popular. But these Tor alternatives would lack something precious: a large user base. In today’s anonymisation networks, the security of a single user is a direct function of the number of overall users. Small darknets are easier to attack, and easier to de-anonymise. The Tor founders, though exceedingly idealistic in other ways, clearly appreciate this reality: a better reputation leads to better security.85 They therefore understand that the popularity of Tor browsing is making the bundled-in, and predominantly illicit, hidden services more secure than they could be on their own. Darknets are not illegal in free countries and they probably should not be. Yet these widely abused platforms – in sharp contrast to the wider public-key infrastructure – are and should be fair game for the most aggressive intelligence and law-enforcement techniques, as well as for invasive academic research. Indeed, having such clearly cordoned-off, free-fire zones is perhaps even useful for the state, because, conversely, a bad reputation leads to bad security. Either way, Tor’s ugly example should loom large in technology debates. Refusing to confront tough, inevitable political choices is simply irresponsible. The line between utopia and dystopia can be disturbingly thin. (pages 32-33)

it’s hard to say nothing and see public discourse soiled with this sort of publication.

First, there is no evidence presented that hidden services have damaged Tor and/or trust in the Internet as a whole. Even the authors concede that Tor is the most popular option anonymous browsing and hidden services. That doesn’t sound like damage to me. You?

Second, the authors dump all hidden services in the “bad, very bad” basket, despite their own research classifying only .004066667% of the Dark Net as illicit pornography. They use stock “go to” examples to shock readers in place of evidence and reasoning.

Third, the charge that Tor has “[r]efused to confront tough, inevitable political choices is simply irresponsible” is false. Demonstrably false because the authors point out that Tor developers made a conscious choice to not take political considerations into account (page 25).

Since Moore & Rid disagree with that choice, they resort to name calling, terming the decision “simply irresponsible.” Moore & Rid are entitled to their opinions but they aren’t going to persuade even a semi-literate audience with name calling.

Take Cryptopolitik and the Darknet as an example of how to not write a well researched and reasoned paper. Although, that isn’t a bar to publication as you can see.

Cheating Cheaters [Honeypots for Government Agencies?]

February 3rd, 2016

Video Game Cheaters Outed By Logic Bombs by timothy.

From the post:

A Reddit user decided to tackle the issue of cheaters within Valve’s multiplayer shooter Counter Strike: Global Offensive in their own unique way: by luring them towards fake “multihacks” that promised a motherlode of cheating tools, but in reality, were actually traps designed to cause the users who installed them to eventually receive bans. The first two were designed as time bombs, which activated functions designed to trigger bans after a specific time of day. The third, which was downloaded over 3,500 times, caused instantaneous bans.

I wonder if anyone is running honeypots for intelligence agencies?

Or fake jihad sites for our friends in law enforcement?

Sort of a Spy vs. Spy situation, yes?


Cyber-dueling with government before you aren’t wearing protective gear and the tips aren’t blunted.

Unpublished Black History Photos (NYT)

February 3rd, 2016

The New York Times is unearthing unpublished photos from its archives for Black History Month by Shan Wang.

From the post:

In this black and white photo taken by a New York Times staff photographer, two unidentified second graders at Princeton’s Nassau Street Elementary School stand in front of a classroom blackboard. Some background text accompanies the image, pointing to a 1964 Times article about school integration and adding that the story “offered a caveat that still resonates, noting that in the search for a thriving and equal community, ‘good schooling is not enough.’”

Times readers wrote in to ask specifically about the second graders in the photo, so the Times updated the post with a comment form asking readers to share anything they might know about the girl and boy depicted.

Great background on the Unpublished Black History project at the Times.

Public interfaces enable contribution of information on selected images along with comments.

Unlike the US Intelligence community, the Times is willing to admit that its prior conduct may not reflect (then) or current values.

If a private, for-profit organization can be that honest, what’s the deal with government agencies?

Must be that accountability thing that Republicans are always trying to foist off onto public school teachers and public school teachers alone.

No accountability for elected officials and/or their appointees and cronies.

They are deadly serious about crypto backdoors [And of the CIA and Chinese Underwear]

February 3rd, 2016

They are deadly serious about crypto backdoors by Robert Graham.

From the post:

Julian Sanchez (@normative) has an article questioning whether the FBI is serious about pushing crypto backdoors, or whether this is all a ploy pressuring companies like Apple to give them access. I think they are serious — deadly serious.

The reason they are only half-heartedly pushing backdoors at the moment is that they believe we, the opposition, aren’t serious about the issue. After all, the 4rth Amendment says that a “warrant of probable cause” gives law enforcement unlimited power to invade our privacy. Since the constitution is on their side, only irrelevant hippies could ever disagree. There is no serious opposition to the proposition. It’ll all work itself out in the FBI’s favor eventually. Among the fascist class of politicians, like the Dianne Feinsteins and Lindsay Grahams of the world, belief in this principle is rock solid. They have absolutely no doubt.

But the opposition is deadly serious. By “deadly” I mean this is an issue we are willing to take up arms over. If congress were to pass a law outlawing strong crypto, I’d move to a non-extradition country, declare the revolution, and start working to bring down the government. You think the “Anonymous” hackers were bad, but you’ve seen nothing compared to what the tech community would do if encryption were outlawed.

On most policy questions, there are two sides to the debate, where reasonable people disagree. Crypto backdoors isn’t that type of policy question. It’s equivalent to techies what trying to ban guns would be to the NRA.

What he says.

Crypto backdoors are a choice between a policy that benefits government at the expense of everyone (crypto backdoors) versus a policy that benefits everyone at the expense of the government (no crypto backdoors). It’s really that simple.

When I say crypto backdoors benefit the government, I mean that quite literally. Collecting data via crypto backdoors and otherwise, enables government functionaries to pretend to be engaged in meaningful responses to serious issues.

Collecting and shoveling data from desk to desk is about as useless an activity as can be imagined.

Basis for that claim? Glad you asked!

If you haven’t read: Chinese Underwear and Presidential Briefs: What the CIA Told JFK and LBJ About Mao by Steve Usdin, do so.

Steve covers the development of the “presidential brief” and its long failure to provide useful information about China and Mao in particular. The CIA long opposed declassification of historical presidential briefs based on the need to protect “sources and methods.”

The presidential briefs for the Kennedy and Johnson administrations have been released and here is what Steve concludes:

In any case, at least when it comes to Mao and China, the PDBs released to date suggest that the CIA may have fought hard to keep the these documents secret not to protect “sources and methods,” but rather to conceal its inability to recruit sources and failure to provide sophisticated analyses.

Past habits of the intelligence community explain rather well why they have no, repeat no examples of how strong encryption as interfered with national security. There are none.

The paranoia about “crypto backdoors” is another way to engage in “known to be useless” action. It puts butts in seats and inflates agency budgets.

Unlike Robert, should Congress ban strong cryptography, I won’t be moving to a non-extradition country. Some of us need to be here when local police come to their senses and defect.

Google Paywall Loophole Going Bye-Bye [Fair Use Driving Pay-Per-View Traffic]

February 3rd, 2016

The Wall Street Journal tests closing the Google paywall loophole by Lucia Moses.

From the post:

The Wall Street Journal has long had a strict paywall — unless you simply copy and paste the headline into Google, a favored route for those not wanting to pony up $200 a year. Some users have noticed in recent days that the trick isn’t working.

A Journal spokesperson said the publisher was running a test to see if doing so would entice would-be subscribers to pay up. The rep wouldn’t elaborate on how long and extensive the experiment was and if permanently closing the loophole was a possible outcome.

“We are experimenting with a number of different trial mechanics at the moment to provide a better subscription taster for potential new customers,” the rep said. “We are a subscription site and we are always looking at better ways to optimize The Wall Street Journal experience for our members.”

The Wall Street Journal can deprive itself of the benefits of “fair use” if it wants to, but is that a sensible position?

Fair Use Benefits the Wall Street Journal

Rather than a total ban on copying, what if the amount of an article that can be copied is set by algorithm? Such that at a minimum, the first two or three paragraphs of any story can be copied, whether you arrive from Google or directly on the WSJ site.

Think about it. Wall Street Journal readers aren’t paying to skim the lead paragraphs in the WSJ. They are paying to see the full story and analysis in particular subject areas.

Bloggers, such as myself, cannot drive content seekers to the WSJ because the first sentence or two isn’t enough for readers to develop an interest in the WSJ report.

If I could quote the first 2 or 3 paragraphs, add in some commentary and perhaps other links, then a visitor to the WSJ is visiting to see the full content the Wall Street Journal has to offer.

The story lead is acting, as it should, to drive traffic to the Wall Street Journal, possibly from readers who won’t otherwise think of the Wall Street Journal. Some of my readers on non-American/European continents for example.

Bloggers Driving Readers to Wall Street Journal Pay-Per-View Content

By developing algorithmic fair use as I describe it would enlist an army of bloggers in spreading notice of pay-per-view content of the Wall Street Journal, at no expense to the Wall Street Journal. As a matter of fact, bloggers would be alerting readers of pay-per-view WSJ content, at the blogger’s own expense.

It may just be me but if someone were going to drive viewers to pay-per-view content on my site, at their own expense, with fair use of content, I would be insane to prevent that. But, I’m not the one grasping at dimes while $100 bills are flying overhead.

Close the Loophole, Open Up Fair Use

Full disclosure, I don’t have any evidence for fair use driving traffic to the Wall Street Journal because that evidence doesn’t exist. The Wall Street Journal would have to enable fair use and track appearance of fair use content and the traffic originating from it. Along with conversions from that additional traffic.

Straight forward data analytics but it won’t happen by itself. When the WSJ succeeds with such a model, you can be sure that other paywall publishers will be quick to follow suite.

Caveat: Yes, there will be people who will only ever consume the free use content. And your question? If they aren’t ever going to be paying customers and the same fair use is delivering paying customers, will you lose the latter in order to spite the former?

Isn’t that like cutting off your nose to spite your face?

Historical PS:

I once worked for a publisher that felt a “moral obligation,” their words, not mine, to prevent anyone from claiming a missing journal issue to which they might not be entitled. Yeah. Journal issues that were as popular as the Watchtower is among non-Jehovah’s Witnesses. Cost to the publisher, about $3.00 per issue, cost to verify entitlement, a full time position at the publisher.

I suspect claims ran less than 200 per year. My suggestion was to answer any request with thanks, here’s your missing copy. End of transaction. Track claims only to prevent abuse. Moral outrage followed.

Is morality the basis for your pay-per-view access policy? I thought pay-per-view was a way to make money.

Pass this post along to the WSJ if you know anyone there. Free suggestion. Perhaps they will be interested in other, non-free suggestions.