SIGIR 2015 Technical Track

May 4th, 2015

SIGIR 2015 Technical Track

The list of accepted papers for SIGIR 2015 Technical Track have been published!

As if you need any further justification to attend the conference in Santiago, Chile, August 9-13, 2015.

Curious, would anyone be interested in a program listing that links the authors to their DBLP listings? Just in case you want to catch up on their recent publications before the conference?


Notes on Theory of Distributed Systems

May 4th, 2015

Notes on Theory of Distributed Systems by James Aspnes.

From the preface:

These are notes for the Spring 2014 semester version of the Yale course CPSC 465/565 Theory of Distributed Systems. This document also incorporates the lecture schedule and assignments, as well as some sample assignments from previous semesters. Because this is a work in progress, it will be updated frequently over the course of the semester.

Notes from Fall 2011 can be found at

Notes from earlier semesters can be found at

Much of the structure of the course follows the textbook, Attiya and Welch’s Distributed Computing [AW04], with some topics based on Lynch’s Distributed Algorithms [Lyn96] and additional readings from the research literature. In most cases you’ll find these materials contain much more detail than what is presented here, so it is better to consider this document a supplement to them than to treat it as your primary source of information.

When something exceeds three hundred (> 300) pages, I have trouble calling it “notes.” ;-)

A treasure trove of information on distributed computing.

I first saw this in a tweet by Henry Robinson.

Breaking the Silence – Gaza – “There were no rules”

May 4th, 2015

New report details how Israeli soldiers killed civilians in Gaza: “There were no rules” by William Booth.

From the post:

On Monday, an organization of Israeli soldiers known as “Breaking the Silence” released a report containing testimonies from more than 60 officers and soldiers from the Israel Defense Forces who served during the 50-day war against Hamas militants last summer in the Gaza Strip.

An Israel Defense Forces spokesman declined to respond to details in the report, saying Breaking the Silence refuses to share information with the IDF “in a manner which would allow a proper response, and if required, investigation.” The spokesman added that “contrary to their claims, this organization does not act with the intention of correcting any wrongdoings they allegedly uncovered.”

The soldiers who testified received guarantees of anonymity from Breaking the Silence. The 240-page book in English can be found online here.

Don’t you like the IDF response:

in a manner which would allow a proper response, and if required, investigation.

Of course, not anonymous but with names and who was with you (other people that could be pressured), attendant damage to your career or future job prospects, etc.

Not to single out the IDF for criticism. Virtually the same response has been given by the U.S. military for a variety of issues.

Governments and their military services fear transparency because transparency could lead to accountability. Civilians should not second-guess decisions made in the heat of battle by combat troops. Their leaders, who made decisions for political gain, should certainly be called to account.

Running Spark GraphX algorithms on Library of Congress subject heading SKOS

May 4th, 2015

Running Spark GraphX algorithms on Library of Congress subject heading SKOS by Bob Ducharme.

From the post:

Well, one algorithm, but a very cool one.

Last month, in Spark and SPARQL; RDF Graphs and GraphX, I described how Apache Spark has emerged as a more efficient alternative to MapReduce for distributing computing jobs across clusters. I also described how Spark’s GraphX library lets you do this kind of computing on graph data structures and how I had some ideas for using it with RDF data. My goal was to use RDF technology on GraphX data and vice versa to demonstrate how they could help each other, and I demonstrated the former with a Scala program that output some GraphX data as RDF and then showed some SPARQL queries to run on that RDF.

Today I’m demonstrating the latter by reading in a well-known RDF dataset and executing GraphX’s Connected Components algorithm on it. This algorithm collects nodes into groupings that connect to each other but not to any other nodes. In classic Big Data scenarios, this helps applications perform tasks such as the identification of subnetworks of people within larger networks, giving clues about which products or cat videos to suggest to those people based on what their friends liked.

As so typically happens when you are reading one Bob DuCharme post, you see another that one requires reading!

Bob covers storing RDF in RDD (Resilient Distributed Dataset), the basic Spark data structure, creating the report on connected components and ends with heavily commented code for his program.

Sadly the “related” values assigned by the Library of Congress don’t say how or why the values are related, such as:

“Hiding places”





Related values could be useful in some cases but if I am searching on “privacy,” as in the sense of being free from government intrusion, then “solitude,” “loneliness,” and “hiding places” aren’t likely to be helpful.

That’s not a problem with Spark or SKOS, but a limitation of the data being provided.

SPARQL in 11 minutes (Bob DuCharme)

May 4th, 2015

From the description:

An introduction to the W3C query language for RDF. See for more.

I first saw this in Bob DuCharme’s post: SPARQL: the video.

Nothing new for old hands but useful to pass on to newcomers.

I say nothing new, I did learn that Bob has a Korg Monotron synthesizer. Looking forward to more “accompanied” blog posts. ;-)

FOIA and 5,000 Blank Pages

May 4th, 2015

FBI replies to Stingray Freedom of Information request with 5,000 blank pages by Cory Doctorow.

Cory has a great post on FBI stonewalling on information about “Stingrays,” devices that act as cell phone towers to gather information from cell phone users.

The FBI response illustrates the issue I raised in Debating Public Policy, On The Basis of Fictions, which was:

To hold government accountable, its citizens need to know what government has been doing, to whom and why.

There is no place in the Constitution that says citizens are entitled only to some information, to a little information, to the information the executive branch decides to share (or the legislative branch for that matter), etc.

Every blank page in that FOIA answer diminishes your right as a citizen to control your government. That’s the part the FBI keeps overlooking. It’s not their government, it not the government of the NSA, it is the government of every voting citizen.

“The ultimate goal is evidence-based data analysis”

May 4th, 2015

Statistics: P values are just the tip of the iceberg by Jeffrey T. Leek & Roger D. Peng.

From the summary:

Ridding science of shoddy statistics will require scrutiny of every step, not merely the last one, say Jeffrey T. Leek and Roger D. Peng.

From the post:


Leek and Peng are right but I would shy away from ever claiming “…evidence-based data analysis.”

You can disclose the choices you make at every stage of the data pipeline but the result isn’t “…evidence-based data analysis.”

I say that because “…evidence-based data analysis” implies that whatever the result, human agency wasn’t a factor in it. On the contrary, an ineffable part of human judgement is a part of every data analysis.

The purpose of documenting the details of each step is to enable discussion and debate about the choices made in the process.

Just as I object to politicians wrapping themselves in national flags, I equally object to anyone wrapping themselves in “evidence/facts” as though they and only they possess them.

Montage Mosaics The Pillars Of Creation!

May 4th, 2015

Montage Mosaics The Pillars Of Creation!

From the post:

The Pillars of Creation in the Eagle Nebula (M16) remain one of the iconic images of the Hubble Space Telescope. Three pillars rise from a molecular cloud into an enormous HII region, powered by the massive young cluster NGC 6611. Such pillars are common in regions of massive star formation, where they form as a result of ionization and stellar winds.

In a paper that will shortly be published in MNRAS, entitled “The Pillars of Creation revisited with MUSE: gas kinematics and high-mass stellar feedback traced by optical spectroscopy,” McLeod et al (2015) analyze of new data acquired with the Multi Unit Spectroscopy Explorer (MUSE) instrument on the VLT. They used Montage to create integrated line maps of the single pointings obtained at the telescope. The figure below shows an example of these maps:


The images were too spectacular to pass without reposting.

Also a reminder that “national security” posturing has all the significance of a peacock spreading its feathers. Of interest to other peacocks, possibly female ones, not of much interest to anyone else.


It’s too bad that Hieronymus Bosch isn’t still around. I can easily imagine him painting “The Garden of Paranoid Delights,” for the security establishment.

Association Discovery Beyond Ten Variables – Prioritized Chordalysis

May 4th, 2015

Scaling log-linear analysis to datasets with thousands of variables by François Petitjean and Geoffrey I. Webb.


Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

The authors reduce the number of edges required to be examined in one example from 10,000,000 to 10,000, with corresponding savings in computation time. That was not an artifact of the data set but has been generalized by the authors and released as open source code: Chordalysis (GitHub).

If you prefer numbers, the analysis of a data set with 10,000,000 edges went from 39 hours to 27 seconds, a speedup of more than 5200X.

Definitely an addition to your data mining toolkit!

Distributed Machine Learning with Apache Mahout

May 4th, 2015

Distributed Machine Learning with Apache Mahout by Ian Pointer and Dr. Ir. Linda Terlouw.

The Refcard for Mahout takes a different approach from many other DZone Refcards.

Instead of a plethora of switches and commands, it covers two basis tasks:

  • Training and testing a Random Forest for handwriting recognition using Amazon Web Services EMR
  • Running a recommendation engine on a standalone Spark cluster

Different style from the usual Refcard but a welcome addition to the documentation available for Apache Mahout!


How fear and self-preservation are driving a cyber arms race disaster

May 3rd, 2015

How fear and self-preservation are driving a cyber arms race by Max Taves.

From the post:

When a man was fired from his job in Minneapolis, Minn., last May, he inadvertently touched off a boom in Silicon Valley.

Gregg Steinhafel, then a 35-year veteran of Target and its CEO, was shown the door after hackers infiltrated the retailer’s computer systems, stealing 70 million shoppers’ information and 40 million credit and debit card numbers. It turned out the hack might have been prevented, had the company not ignored warnings from its own security systems.

It happened again in December, when Amy Pascal, one of the most powerful women in Hollywood, was fired from her job heading up Sony Pictures after hackers exposed thousands of financial documents and emails revealing the film studio’s inner secrets. The hack captured the world’s attention and elicited criticism from customers, industry leaders and even the president of the United States.

Pascal’s and Steinhafel’s exits sent shockwaves through corporate America. The message was clear: Top executives will be held responsible for their companies’ cybersecurity failings.

The result, venture capitalists say, has been a boom for cybersecurity startups. In ways that previous attacks on consumers never did, the firings have sparked a scramble for new security technology by companies desperate to head off the next costly, embarrassing cyberattack. And venture capitalists are responding, pouring unprecedented billions into a dizzying array of young companies and their, largely, untested products.

Last year, these companies received an aggregate $2.39 billion in funding, a 35 percent increase over 2013, according to venture capital data firm CB Insights. That’s the most money that’s been funneled into cybersecurity companies ever. Silicon Valley is betting companies have woken up to the real dangers of living in the Internet age.

(emphasis added)

Wait! Do you remember the graphic for point-of-sales systems?


The security faults of these systems are in software.

So, $2.39 billion in being invested in software (which will have vulnerabilities) to sit on top of already vulnerable systems.

Somehow, that fails to fill me with warm fuzzies.

Funding research on better software engineering techniques, research on and adoption of standard software practices, funding dissemination of security research and information, etc., would all be positive contributions to improving computer security.

Using techniques known to produce vulnerable software and expecting an improvement in security is by definition, insanity.

Advisers to venture capitalists need to check their E&O policies before advising clients to invest in security software.

Debating Public Policy, On The Basis of Fictions

May 3rd, 2015

Striking a Balance—Whistleblowing, Leaks, and Security Secrets by Cody Poplin.

From the post:

Last weekend, the New York Times published an article outlining the strength of congressional support for the CIA targeted killing program. In the story, the Times also purported to reveal the identities of three covert CIA operatives who now hold senior leadership roles within the Agency.

As you might expect, the decision generated a great deal of controversy, which Lawfare covered here and here. Later in the week, Jack Goldsmith interviewed Executive Editor of the New York Times Dean Baquet to discuss the decision. That conversation also prompted responses from Ben, Mark Mazzetti (one of the authors of the piece), and an anonymous intelligence community reader.

Following Times’ story, the Johns Hopkins University Center for Advanced Governmental Studies, along with the James Madison Project and our friends at Just Security, hosted an a timely conference on Secrecy, Openness and National Security: Lessons and Issues for the Next Administration. In a panel entitled Whistleblowing and America’s Secrets: Ensuring a Viable Balance, Bob Litt, General Counsel for the Office of the Director of National Security, blasted the Times, saying that the paper had “disgraced itself.”

However, the panel—which with permission from the Center for Advanced Governmental Studies, we now present in full—covered much more than the latest leak published in the Times. In a conversation moderated by Mark Zaid, the Executive Director of the James Madison Project, Litt, along with Ken Dilanian, Dr. Gabriel Schoenfeld, and Steve Vladeck, tackled a vast array of important legal and policy questions surrounding classified leak prosecutions, the responsibilities of the press, whistleblower protections, and the future of the Espionage Act.

It’s a jam-packed discussion full of candid exchanges—some testy, most cordial—that greatly raises the dialogue on the recent history of leaks, prosecutions, and future lessons for the next Administration.

Spirited debate but on the basis of known fictions.

For example, Bob Litt, General Counsel for the Office of the Director of National Security, poses a hypothetical question that compares an alleged suppression of information about the Bay of Pigs invasion to whether a news organization would be justified in leaking the details of plans to assassinate Osama bin Laden.

The premise of the hypothetical is flawed. It is based on an alleged statement by President Kennedy wishing the New York Times had published the details in their possession. One assumes so that public reaction would have prevented the ensuing disaster.

The story of President Kennedy suppressing a story in the New York Times about the Bay of Pigs is a myth.

Busting the NYTimes suppression myth, 50 years on reports:

Indeed, the Times’ purported spiking has been called the “symbolic journalistic event of the 1960s.”

Only the Times didn’t censor itself.

It didn’t kill, spike, or otherwise emasculate the news report published 50 years ago tomorrow that lies at the heart of this media myth.

That article was written by a veteran Times correspondent named Tad Szulc, who reported that 5,000 to 6,000 Cuban exiles had received military training for a mission to topple Fidel Castro’s regime; the actual number of invaders was about 1,400.

The story, “Anti-Castro Units Trained At Florida Bases,” ran on April 7, 1961, above the fold on the front page of the New York Times.

The invasion of the Bay of Pigs happened ten days later, April 17, 1961.

Hardly sounds like suppression of the story does it?

That is just one fiction that formed the basis for part of the discussion in this podcast.

Another fiction is that leaked national security information, take some of Edward Snowden‘s materials for example, were damaging to national security. Except that those who claim to know can’t say what information or how it was damaging.

Without answers to what information and how it was damaging to national security, their claims of “damage to national security” should go straight into the myth bin. The unbroken record of leaks shows illegal activity, incompetence, waste and avoidance of responsibility. None of those are in the national interest.

If the media does want to act in the “public interest,” then it should stop repeating unsubstantiated claims of damage to the “national interest,” by the security community. Repeated falsehoods does not make them useful for debates of public policy. When advanced such claims should be challenged and then excluded from further discussion without sufficient details for the public to reach their own conclusion about the claim.

Another myth in this discussion is the assumption that the media has a in loco parentis role vis-a-vis the public. That media representatives should act on the public’s behalf in determining what is or is not in the “public interest.” Complete surprise to me and I have read the Constitution more than once or twice.

I don’t remember seeing the media called out in the Constitution as guardians for a public too stupid to decide matters of public policy for itself.

That is the central flaw with national security laws and the rights of leakers and leakees. The government of the United States, for those unfamiliar with the Constitution, is answerable under the Constitution to the citizens of the United States. Not any branch of government or its agencies but to the citizens.

There are no exceptions to United States government being accountable to its citizens. Not one. To hold government accountable, its citizens need to know what government has been doing, to whom and why. The government has labored long and hard, especially its security services, to avoid accountability to its citizens. Starting shortly after its inception.

There should be no penalties for leakers or leakees. Leaks will cause hardships, such as careers ending due to dishonestly, incompetence, waste and covering for others engaged in the same. If you don’t like that, move to a country where the government isn’t answerable to its citizens. May I suggest Qatar?

You Can Help Keep Others Secure (Use Tor)

May 3rd, 2015

Tor Browser 4.5 released by Mike Perry.

From the post:

The Tor Browser Team is proud to announce the first stable release in the 4.5 series. This release is available from the Tor Browser Project page and also from our distribution directory.

The 4.5 series provides significant usability, security, and privacy enhancements over the 4.0 series. Because these changes are significant, we will be delaying the automatic update of 4.0 users to the 4.5 series for one week.

Time to upgrade!

Why use Tor?

The Tor network is a group of volunteer-operated servers that allows people to improve their privacy and security on the Internet. Tor’s users employ this network by connecting through a series of virtual tunnels rather than making a direct connection, thus allowing both organizations and individuals to share information over public networks without compromising their privacy. Along the same line, Tor is an effective censorship circumvention tool, allowing its users to reach otherwise blocked destinations or content. Tor can also be used as a building block for software developers to create new communication tools with built-in privacy features.

Individuals use Tor to keep websites from tracking them and their family members, or to connect to news sites, instant messaging services, or the like when these are blocked by their local Internet providers. Tor’s hidden services let users publish web sites and other services without needing to reveal the location of the site. Individuals also use Tor for socially sensitive communication: chat rooms and web forums for rape and abuse survivors, or people with illnesses.

Journalists use Tor to communicate more safely with whistleblowers and dissidents. Non-governmental organizations (NGOs) use Tor to allow their workers to connect to their home website while they’re in a foreign country, without notifying everybody nearby that they’re working with that organization.

Groups such as Indymedia recommend Tor for safeguarding their members’ online privacy and security. Activist groups like the Electronic Frontier Foundation (EFF) recommend Tor as a mechanism for maintaining civil liberties online. Corporations use Tor as a safe way to conduct competitive analysis, and to protect sensitive procurement patterns from eavesdroppers. They also use it to replace traditional VPNs, which reveal the exact amount and timing of communication. Which locations have employees working late? Which locations have employees consulting job-hunting websites? Which research divisions are communicating with the company’s patent lawyers?

A branch of the U.S. Navy uses Tor for open source intelligence gathering, and one of its teams used Tor while deployed in the Middle East recently. Law enforcement uses Tor for visiting or surveilling web sites without leaving government IP addresses in their web logs, and for security during sting operations.

The variety of people who use Tor is actually part of what makes it so secure. Tor hides you among the other users on the network, so the more populous and diverse the user base for Tor is, the more your anonymity will be protected. (From

If you are concerned about privacy, yours and of others, use a Tor browser by default.

Sony Emails and Dilbert Cartoons

May 2nd, 2015

WikiLeaks Adds More Hacked Emails From Sony Pictures Entertainment by Sohini Auddy.

From the post:

WikiLeaks has added thousands more of Sony Pictures Entertainment’s hacked emails in its database, as mentioned in a Twitter post on Thursday.

Sony has yet to develop a sense of humor over the hack attack late last year.

Suggestion: Search the Sony emails at Wikileaks and then the Dilbert archives for a matching Dilbert cartoon.

Tweet the link for the Sony email and your matching Dilbert cartoon, #sonydilbert.

Let’s try that for a week, ending May 9, 2014.

Tweet with the most retweets will be declared the winner by acclamation. (Contest not open to Sony managers.)


Homonyms on EOL

May 2nd, 2015

Homonyms on EOL [Encyclopedia of Life]

From the webpage:

Please join the Homonym Hunters community and help us find all the homonyms on EOL!

This collection is for all kinds of homonyms:

Cross-code homonyms

Homonyms across nomenclatural codes (ICBN, ICZN, ICNB, ICTV) are allowed, so there are plenty of them. Example: Satyrium, the orchid genus and Satyrium, the butterfly genus.

Cross-rank homonyms

At least in zoological nomenclature, homonyms are allowed if they refer to groups at different ranks. Example: Polyphaga, the roach genus and Polyphaga, the beetle suborder.

Invalid homonyms

Within codes and ranks, homonyms are not allowed, so only one of the homonymous names can be valid/accepted. If EOL gets these invalid names from a provider, we will have a page for it. Example: Acanthurus, the surgeon fish genus and Acanthurus, the weevil genus.

Comprehensive lists of homonyms have also been compiled elsewhere:

Systema Naturae 2000: Homonyms

Wikispecies: List of valid homonyms

In topic map parlance, the identification of homonyms across nomenclatural codes and across different ranks translates into setting the scope on a homonym.

That helps both people and machines in distinguishing homonyms.

For merging purposes, that also helps merge homonyms correctly. For example, Aaron Black tweeted:


As seen in the Washington Post.

Close to being a homonym anyway. ;-) I could distinguish Kirstie Alley from any possible Christie ally, even on a bad day. Our machines, not so much.

HT: Sam Hunting for the tweet.

On The Bleeding Edge – PySpark, DataFrames, and Cassandra

May 2nd, 2015

On The Bleeding Edge – PySpark, DataFrames, and Cassandra.

From the post:

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.

If you need help deciding whether to read this post, take a look at Spark SQL and DataFrame Guide to see what you stand to gain.


Intelligent Life in Congress!

May 2nd, 2015

Congressman with computer science degree: Encryption back-doors are ‘technologically stupid’ by Adrea Peterson.

Two tidbit from a must read story:

“It is clear to me that creating a pathway for decryption only for good guys is technologically stupid,” said Rep. Ted Lieu (D-Calif.), who has a bachelor’s in computer science from Stanford University. “You just can’t do that.”

Subcommittee Chair Will Hurd (R-Tex.), who also has a computer science degree and worked in information security after nearly a decade at the CIA, shared Lieu’s skepticism of the security of such back doors. As did Rep. Blake Farenthold (R-Tex.), who asked the panel of witnesses to raise their hands if they thought it was possible to build a technically secure back-door — often mockingly called a “golden key” — into modern encryption systems.

None of them did — including Amy Hess, executive assistant director of the FBI’s Science and Technology Branch, and Daniel F. Conley, the district attorney for Suffolk County in Massachusetts. Conley at one point argued that companies like Apple are protecting “those who rape, defraud, assault, or even kill” with their encryption policies. (Lieu later said he took “great offense” at this comment, which he called a “fundamental misunderstanding of the problem.”)

Since I am so quick to point out dumb things that members of Congress do or say, it’s only appropriate that I highlight when one or more of them does something right.

No promises you will be able to contact either representative, given the provincialism of members of Congress who communicate only with members of their own districts, but its worth a shot.

Rep. Ted Lieu (D-Calif)

Washington, DC Office
415 Cannon House Office Building
Washington, DC 20515
Phone: (202) 225-3976

Rep. Will Hurd (R-Tex.)

Washington, DC Office
317 Cannon House Office Building
Washington, DC 20515
Phone: (202) 225-4511

Cheer them on and send money if you can.

New Natural Language Processing and NLTK Videos

May 2nd, 2015

Natural Language Processing With Python and NLTK p.1 Tokenizing words and Sentences and Stop Words – Natural Language Processing With Python and NLTK p.2 by Harrison Kinsley.

From part 1:

Natural Language Processing is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language.

The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text.

NLTK also comes with a large corpora of data sets containing things like chat logs, movie reviews, journals, and much more!

Bottom line, if you’re going to be doing natural language processing, you should definitely look into NLTK!

Playlist link:…

sample code:

Use the Playlist link:… link as I am sure more videos will be appearing in the near future.


The Power of Symmetry

May 2nd, 2015

The Power of Symmetry by Felienne Hermans.

From the description:

This presentation by @Felienne presents programming problems, and how they can be solved efficiently and elegantly using symmetry.

The description is true but fails to capture the elegance of of Felienne’s presentation as she uses symmetry to dramatically reduce the number of states in classic programming problems.

Highly recommended if you need to “wow” a student or class with what is possible by looking just a bit deeper at a problem.

Replication in Psychology?

May 1st, 2015

First results from psychology’s largest reproducibility test by Monya Baker.

From the post:

An ambitious effort to replicate 100 research findings in psychology ended last week — and the data look worrying. Results posted online on 24 April, which have not yet been peer-reviewed, suggest that key findings from only 39 of the published studies could be reproduced.

But the situation is more nuanced than the top-line numbers suggest (See graphic, ‘Reliability test’). Of the 61 non-replicated studies, scientists classed 24 as producing findings at least “moderately similar” to those of the original experiments, even though they did not meet pre-established criteria, such as statistical significance, that would count as a successful replication.

The project, known as the “Reproducibility Project: Psychology”, is the largest of a wave of collaborative attempts to replicate previously published work, following reports of fraud and faulty statistical analysis as well as heated arguments about whether classic psychology studies were robust. One such effort, the ‘Many Labs’ project, successfully reproduced the findings of 10 of 13 well-known studies3.

Replication is a “hot” issue and likely to get hotter if peer review shifts to be “open.”

Do you really want to be listed as a peer reviewer for a study that cannot be replicated?

Perhaps open peer review will lead to more accountability of peer reviewers.


Security Incentives With Bite?

May 1st, 2015

SEC Releases Cybersecurity Guidance, Highlights Compliance Role

From the post:

The SEC’s Division of Investment Management recently released cybersecurity guidance highlighting best practices and warning that cybersecurity breaches and deficiencies in cybersecurity programs could cause funds and advisers to run afoul of securities laws. Importantly, the guidance places significant obligations on compliance officers to ensure that funds have adopted adequate cybersecurity policies and procedures.

The guidance recommends that funds and advisers conduct periodic cybersecurity assessments; create a strategy to prevent, identify, and respond to cyber threats; and implement the strategy through policies, procedures, and training that help to guide officers and employees and monitor compliance. According to the guidance, periodic assessments should include attention to internal and external vulnerabilities as well as the likely effects of a breach so that funds and advisers can better assess and mitigate risk. With respect to cybersecurity strategies, funds and advisers should consider exerting tighter control over data access, ramping up encryption, limiting the use of removable storage media to prevent data theft, monitoring system access, backing up data, developing an incident response plan, and implementing routine testing.

First step, make cybersecurity breaches into violations of something important, like securities laws.

Second step, prosecute violations of securities laws rooted in cybersecurity breaches.

Third step, defendants in securities actions take an interest in spreading the joy of securities liabilities.

Fourth step, software liability doctrines develop in the context of securities litigation.

Liability for software defects is coming.

The question is whether it will develop piecemeal and unexpectedly, or will it develop in a comprehensive and moderated fashion?

How’s your appetite for risk?

I first saw this in a tweet by Milo Camacho.

Large-Scale Social Phenomena – Data Mining Demo

May 1st, 2015

Large-Scale Social Phenomena – Data Mining Demo by Artemy Kolchinsky.

From the webpage:

For your mid-term hack-a-thons, you will be expected to quickly acquire, analyze and draw conclusion from some real-world datasets. The goal of this tutorial is to provide you with some tools that will hopefully enable you to spend less time debugging and more time generating and testing interesting ideas.

Here, I chose to focus on Python. It is beautiful language that is quickly developing an ecosystem of powerful and free scientific computing and data mining tools (e.g. the Homogenization of scientific computing, or why Python is steadily eating other languages’ lunch). For this reason, as well as my own familiarity with it, I encourage (though certainly not require) you to use it for your mid-term hack-a-thons. From my own experience, getting comfortable with these tools will pay off in terms of making many future data analysis projects (including perhaps your final projects) easier & more enjoyable.

Just in time for the weekend! I first saw this in a tweet by Lynn Cherny.

Suggestions of odd data sources for mining?


May 1st, 2015

OPenn: Primary Digital Resources Available to All through Penn Libraries’ New Online Platform by Jessie Dummer.

From the post:

The Penn Libraries and the Schoenberg Institute for Manuscript Studies are thrilled to announce the launch of OPenn: Primary Resources Available to Everyone (, a new website that makes digitized cultural heritage material freely available and accessible to the public. OPenn is a major step in the Libraries’ strategic initiative to embrace open data, with all images and metadata on this site available as free cultural works to be freely studied, applied, copied, or modified by anyone, for any purpose. It is crucial to the mission of SIMS and the Penn Libraries to make these materials of great interest and research value easy to access and reuse. The OPenn team at SIMS has been working towards launching the website for the past year. Director Will Noel’s original idea to make our Medieval and Renaissance manuscripts open to all has grown into a space where the Libraries can collaborate with other institutions who want to open their data to the world.

Images of the manuscripts are currently available on OPenn at full resolution, with derivatives also provided for easy reuse on the web. Downloading, whether several select images or the entire dataset, is easily accomplished by following instructions or recipes posted in the Technical Read Me on OPenn. The website is designed to be machine-readable, but easy for individuals to use, too.

Oh, the manuscripts themselves?

Licensing is a real treat:

All images and their contents from the Lawrence J. Schoenberg Collection are free of known copyright restrictions and in the public domain. See the Creative Commons Public Domain Mark page for more information on terms of use:

Unless otherwise stated, all manuscript descriptions and other cataloging metadata are ©2015 The University of Pennsylvania Libraries. They are licensed for use under a Creative Commons Attribution Licensed version 4.0 (CC-BY-4.0):

For a description of the terms of use see, the Creative Commons Deed:

In substance and licensing such a departure from academic societies that still consider comping travel and hotel rooms as “fostering scholarship.” “Ye shall know them by their fruits.” (Matthew 7:16)

Practical Text Analysis using Deep Learning

May 1st, 2015

Practical Text Analysis using Deep Learning by Michael Fire.

From the post:

Deep Learning has become a household buzzword these days, and I have not stopped hearing about it. In the beginning, I thought it was another rebranding of Neural Network algorithms or a fad that will fade away in a year. But then I read Piotr Teterwak’s blog post on how Deep Learning can be easily utilized for various image analysis tasks. A powerful algorithm that is easy to use? Sounds intriguing. So I decided to give it a closer look. Maybe it will be a new hammer in my toolbox that can later assist me to tackle new sets of interesting problems.

After getting up to speed on Deep Learning (see my recommended reading list at the end of this post), I decided to try Deep Learning on NLP problems. Several years ago, Professor Moshe Koppel gave a talk about how he and his colleagues succeeded in determining an author’s gender by analyzing his or her written texts. They also released a dataset containing 681,288 blog posts. I found it remarkable that one can infer various attributes about an author by analyzing the text, and I’ve been wanting to try it myself. Deep Learning sounded very versatile. So I decided to use it to infer a blogger’s personal attributes, such as age and gender, based on the blog posts.

If you haven’t gotten into deep learning, here’s another opportunity focused on natural language processing. You can follow Michael’s general directions to learn on your own or follow more detailed instructions in his Ipython notebook.


Point-of-Sale (PoS) RAM Scrapers (And Security Incentives)

May 1st, 2015

This graphic speaks volumes about Point-of-Sale (PoS) systems:


From Defending Against PoS RAM Scrapers: Current Strategies and Next-Gen Technologies in Trend Micro.

All is not entirely lost. PoS RAM Scraper Malware: Past, Present, and Future by Numaan Huq, also of Trend Micro.

If you want to understand PoS RAM scrapers at a deeper level than “malware, bad,” this report should meet your needs. It runs ninety-three (93) pages with seventy (70) references.

In terms of security policy to encourage better cybersecurity, losses from bugs in no longer supported software should not be eligible for insurance coverage as business losses, nor tax deductible.

Baltimore Burning and Verification

April 30th, 2015

Baltimore ‘looting’ tweets show importance of quick and easy image checks by Eoghan mac Suibhne.

From the post:

Anyone who has ever asked me for tips on content verification and debunking of fakes knows one of the first things I always mention is reverse image search. It’s one of the simplest and most powerful tools at your disposal. This week provided another good example of how overlooked it is.

Unrest in Baltimore, like any other dramatic event these days, created a surge of activity on social media. In the age of the selfie and ubiquitous cameras, many people have become compulsive chroniclers of all their activities — sometimes unwisely so.

Reactions ranged from shock and disgust to disbelief and amusement when a series of images started to circulate showing looters proudly displaying their ill-gotten gains. Not all, however, was as it seemed.

(emphasis in original)

I often get asked about the fundamentals of verification, and one of the first things I alway mention is the ability — and indeed the reflex — to always perform a reverse image search. I also mention, only half-jokingly, that this should possibly even be added to the school curriculum. It’s not as if it would take up much of the school year; it can be taught in approximately 30 seconds.

In the case of the trashed KFC above, a quick check via Google reverse image search or Tineye showed that the photo was taken in Karachi, Pakistan, in 2012.


Don’t be confused by the “reverse image search” terminology. What you see on Google Images is the standard search box, that includes camera and microphone icons. Choose the camera icon and you will be given the opportunity to search using an image. Paste in an image URL and search. Simple as that.

Imagine describing a standard Google search as a “Google reverse word search.” Confusion and hilarity would ensue pretty quickly.

Develop a habit of verification.

You will have fewer occasions to say, “That’s my opinion and I am entitled to it,” in the face of contrary evidence.

The NYT and Your Security Guardians At Work

April 30th, 2015

Mark Liberman, in R.I.P. Jack Ely, quotes rather extensively from Sam Roberts, “Jack Ely, Who Sang the Kingsmen’s ‘Louie Louie’, Dies at 71“, NYT 4/29/2015, which includes this snippet:

High school and college students who thought they understood what Mr. Ely was singing traded transcripts of their meticulously researched translations of the lyrics. The F.B.I. began investigating after an Indiana parent wrote to Attorney General Robert F. Kennedy in 1964: “My daughter brought home a record of ‘LOUIE LOUIE’ and I, after reading that the record had been banned on the air because it was obscene, proceeded to try to decipher the jumble of words. The lyrics are so filthy that I cannot enclose them in this letter.”

The F.B.I. Laboratory’s efforts at decryption were less fruitful. After more than two years and a 455-page report, the bureau concluded that “three governmental agencies dropped their investigations because they were unable to determine what the lyrics of the song were, even after listening to the records at speeds ranging from 16 r.p.m. to 78 r.p.m.”

It is true that Louie Louie was recorded by the Kingsmen, with Jack Ely as lead signer. It is also true that the FBI, who currently protects you from domestic terrorists and emotionally disturbed teenagers, did an obscenity investigation of the song, but, they concluded the lyrics were incomprehensible.

Where the NYT drops the ball is in attributed to the FBI a 455-page report. You can view the FBI report at: FBI Records: The Vault, under SUBJECT: LOUIE, LOUIE (THE 60’s SONG).

Like the Internet of Things, PDF viewers don’t lie and the page count for the FBI report comes to one hundred and nineteen (119) pages. Of course, the NYT did not have a link to the FBI report or else one of its proof readers could have verified that claim.

The lack of accuracy doesn’t impact the story, except the NYT doesn’t share where it saw the 455-page report from the FBI. Anything is possible and there may be such a report. But without a hyperlink, you know, those things that point to locations on the web, we won’t ever know.

What does the NYT gain by not gracing its readers with links to original materials? There are numerous NYT articles that do, so you have to wonder why it doesn’t happen in all cases?

Suggested rule for New York Times reporters: If you cite a publicly available document or written statement, include a link to the original at the first mention of the document or statement in your story. (Some of us want to know more than will fit into your story.)

What Should Remain After Retraction?

April 30th, 2015

Antony Williams asks in a tweet:

If a paper is retracted shouldn’t it remain up but watermarked PDF as retracted? More than this?

Here is what you get instead of the front page:


A retraction should appear in bibliographic records maintained by the publisher as well as on any online version maintained by the publisher.

The Journal of the American Chemical Society (JACS) method of retraction, removal of the retracted content:

  • Presents a false view of the then current scientific context. Prior to retraction such an article is part of the overall scientific context in a field. Editing that context post-publication, is historical revisionism at its worst.
  • Interrupts the citation chain of publications cited in the retracted publication.
  • Leaves dangling citations of the retracted publication in later publications.
  • Places author who cited the retracted publication in an untenable position. Their citations of a retracted work are suspect with no opportunity to defend their citations.
  • Falsifies the memories of every reader who read the retracted publication. They cannot search for and retrieve that paper in order to revisit an idea, process or result sparked by the retracted publication.

Sound off to: Antony Williams (@ChemConnector) and @RetractionWatch

Let’s leave the creation of false histories to professionals, such as politicians.

New Survey Technique! Ask Village Idiots

April 30th, 2015

I was deeply disappointed to see Scientific Computing with the headline: ‘Avengers’ Stars Wary of Artificial Intelligence by Ryan Pearson.

The respondents are all talented movie stars but acting talent and even celebrity doesn’t give them insight into issues such as artificial intelligence. You might as well ask football coaches about the radiation hazards of a possible mission to Mars. Football coaches, the winning ones anyway, are bright and intelligent folks, but as a class, aren’t the usual suspects to ask about inter-planetary radiation hazards.

President Reagan was known to confuse movies with reality but that was under extenuating circumstances. Confusing people acting in movies with people who are actually informed on a subject doesn’t make for useful news reporting.

Asking Chris Hemsworth who plays Thor in Avengers: Age of Ultron what the residents of Asgard think about relief efforts for victims of the recent earthquake in Nepal would be as meaningful.

They still publish the National Enquirer. A much better venue for “surveys” of the uninformed.

Pwning a thin client in less than two minutes

April 30th, 2015

Pwning a thin client in less than two minutes by Roberto Suggi Liverani

From the post:

Have you ever encountered a zero client or a thin client? It looks something like this…


f yes, keep reading below, if not, then if you encounter one, you know what you can do if you read below…

The model above is a T520, produced by HP – this model and other similar models are typically employed to support a medium/large VDI (Virtual Desktop Infrastructure) enterprise.

These clients run a Linux-based HP ThinPro OS by default and I had a chance to play with image version T6X44017 in particular, which is fun to play with it, since you can get a root shell in a very short time without knowing any password…

Normally, HP ThinPro OS interface is configured in a kiosk mode, as the concept of a thin/zero client is based on using a thick client to connect to another resource. For this purpose, a standard user does not need to authenticate to the thin client per se and would just need to perform a connection – e.g. VMware Horizon View. The user will eventually authenticate through the connection.

The point of this blog post is to demonstrate that a malicious actor can compromise such thin clients in a trivial and quick way provided physical access, a standard prerequisite in an attack against a kiosk.

During my testing, I have tried to harden as much as possible the thin client, with the following options:

Physical security is a commonly overlooked aspect of network security. That was true almost twenty (20) years ago when I was a Novell CNE and that hasn’t changed since. (Physical & Network Security: Better Together In 2014)

You don’t have to take my word for it. Take a walk around your office and see what network or cables equipment could be physically accessed for five minutes or less by any casual visitor. (Don’t forget unattended workstations.)

Don’t spend time and resources on popular “threats” such as China and North Korea when the pizza delivery guy can plug a wireless hub into an open Ethernet port inside your firewall. Yes?

For PR purposes the FBI would describe such a scheme as evidence of advanced networking and computer protocol knowledge. It may be from their perspective. ;-) It shouldn’t be from yours.