## Deep Learning (MIT Press Book) – Update

May 22nd, 2015

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow and Aaron Courville.

I last mentioned this book last August and wanted to point out that a new draft appeared on 19/05/2015.

Typos and opportunities for improvement still exist! Now is your chance to help the authors make this a great book!

Enjoy!

## The Unreasonable Effectiveness of Recurrent Neural Networks

May 22nd, 2015

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy.

From the post:

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?

I try to blog or reblog about worthy posts by others but every now and again, I encounter a post that is stunning in its depth and usefulness.

This post by Andrej Karpathy is one of the stunning ones.

In addition to covering RNNs in general, he takes the reader on a tour of “Fun with RNNs.”

Which covers the application of RNNs to:

• A Paul Graham generator
• Shakespeare
• Wikipedia
• Algebraic Geometry (Latex)
• Linux Source Code

Along with sourcecode, Andrej provides a list of further reading.

What’s your example of using RNNs?

## Harvesting Listicles

May 22nd, 2015

From the post:

Copying tables or lists from a website is not only a painful and dull activity but it’s error prone and not easily reproducible. Thankfully there are packages in Python and R to automate the process. In a previous post we described using Python’s Beautiful Soup to extract information from web pages. In this post we take advantage of a new R package called rvest to extract addresses from an online list. We then use ggmap to geocode those addresses and create a Leaflet map with the leaflet package. In the interest of coding local, we opted to use, as the example, data on wineries and breweries here in the Finger Lakes region of New York.

Lists and listicles are a common form of web content. Unfortunately, both are difficult to improve without harvesting the content and recasting it.

This post will put you on the right track to harvesting with rvest!

BTW, as a benefit to others, post data that you clean/harvest in a clean format. Yes?

## First experiments with Apache Spark at Snowplow

May 22nd, 2015

First experiments with Apache Spark at Snowplow by Justin Courty.

From the post:

As we talked about in our May post on the Spark Example Project release, at Snowplow we are very interested in Apache Spark for three things:

1. Data modeling i.e. applying business rules to aggregate up event-level data into a format suitable for ingesting into a business intelligence / reporting / OLAP tool
2. Real-time aggregation of data for real-time dashboards
3. Running machine-learning algorithms on event-level data

We’re just at the beginning of our journey getting familiar with Apache Spark. I’ve been using Spark for the first time over the past few weeks. In this post I’ll share back with the community what I’ve learnt, and will cover:

I’ve tried to write the post in a way that’s easy to follow-along for other people interested in getting up the Spark learning curve.

What a great post to find just before the weekend!

You will enjoy this one and others in this series.

Have you every considered aggregation into business dashboard to include what is known about particular subjects? We have all seen the dashboards with increasing counts, graphs, charts, etc. but what about non-tabular data?

A non-tabular dashboard?

## Rosetta’s Way Back to the Source

May 22nd, 2015

From the webpage:

The Rosetta project, funded by the EU in the form of an ERC grant, aims to develop techniques to enable reverse engineering of complex software sthat is available only in binary form. To the best of our knowledge we are the first to start working on a comprehensive and realistic solution for recovering the data structures in binary programs (which is essential for reverse engineering), as well as techniques to recover the code. The main success criterion for the project will be our ability to reverse engineer a realistic, complex binary. Additionally, we will show the immediate usefulness of the information that we extract from the binary code (that is, even before full reverse engineering), by automatically hardening the software to make it resilient against memory corruption bugs (and attacks that exploit them).

In the Rosetta project, we target common processors like the x86, and languages like C and C++ that are difficult to reverse engineer, and we aim for full reverse engineering rather than just decompilation (which typically leaves out data structures and semantics). However, we do not necessarily aim for fully automated reverse engineering (which may well be impossible in the general case). Rather, we aim for techniques that make the process straightforward. In short, we will push reverse engineering towards ever more complex programs.

Our methodology revolves around recovering data structures, code and semantic information iteratively. Specifically, we will recover data structures not so much by statically looking at the instructions in the binary program (as others have done), but mainly by observing how the data is used

Research question. The project addresses the question whether the compilation process that translates source code to binary code is irreversible for complex software. Irreversibility of compilation is an assumed property that underlies most of the commercial software today. Specifically, the project aims to demonstrate that the assumption is false.

Herman gives a great thumbnail sketch of the difficulties and potential for this project.

Looking forward to news of a demonstration that “irreversibility of computation” is false.

One important use case being verification that software that claims to have used prevention of buffer overflow techniques has in fact done so. Not the sort of thing I would entrust to statements in marketing materials.

## MINIX 3

May 22nd, 2015

MINIX 3

MINIX 3 is a free open-source operating system that can be used for studying operating systems, as a base for research projects, or for commercial (embedded) systems where microkernel systems dominate the market. Much of the focus on the project is on achieving high reliability through fault tolerance and self-healing techniques.

MINIX is based on a small (about 12K lines of code) microkernel that runs in kernel mode. The rest of the operating system runs as a collection of server processes, each one protected by the hardware MMU. These processes include the virtual file system, one or more actual file systems, the memory manager, the process manager, the reincarnation server, and the device drivers, each one running as a separate user-mode process.

One consequence of this design is that failures of the system due to bugs or attacks are isolated. For example, a failure or takeover of the audio driver due to a bug or exploit can lead to strange sounds but cannot lead to a full takeover of the operating system. Similarly, crashes of a system component can in many cases be automatically and transparently recovered without human intervention. Few, if any, other operating systems are as self-healing as MINIX 3.

Are you still using an insecure and bug-riddled OS?

## Authorea

May 22nd, 2015

Authorea is the collaborative typewriter for academia.

From the website:

Write on the web.
Writing a scientific article should be as easy as writing a blog post. Every document you create becomes a beautiful webpage, which you can share.
Collaborate.
More and more often, we write together. A recent paper coauthored on Authorea by a CERN collaboration counts over 200 authors. If we can solve collaboration for CERN, we can solve it for you too!
Version control.
Authorea uses Git, a robust versioning control system to keep track of document changes. Every edit you and your colleagues make is recorded and can be undone at any time.
Use many formats.
Authorea lets you write in LaTeX, Markdown, HTML, Javascript, and more. Different coauthors, different formats, same document.
Data-rich science.
Did you ever wish you could share with your readers the data behind a figure? Authorea documents can take data alongside text and images, such as IPython notebooks and d3.js plots to make your articles shine with beautiful data-driven interactive visualizations.

Authorea uses a gentle form of open source persuasion. You can have one (1) private article for free but unlimited public articles. As your monthly rate goes up, you can have an increased number of private articles. Works for me because most if not all of my writing/editing is destined to be public anyway.

Standards are most useful when they are writ LARGE so private or “secret” standards have never made sense to me.

## Sharif University

May 22nd, 2015

Treadstone 71 continues to act as an unpaid (so far as I know) advertising agent for Sharif University.

From the university homepage:

Sharif University of Technology is one of the largest engineering schools in the Islamic Republic of Iran. It was established in 1966 under the name of Aryarmehr University of Technology and, at that time, there were 54 faculty members and a total of 412 students who were selected by national examination. In 1980, the university was renamed Sharif University of Technology. SUT now has a total of 300 full-time faculty members, approximately 430 part-time faculty members and a student body of about 12,000.

Treadstone 71 has published course notes from an Advanced Network Security course on honeypots.

There are many documents available on honeypot detection. Not too many are found as a Master’s course at University levels. Sharif University as part of the Iranian institutionalized efforts to build a cyber warfare capability for the government in conjunction with AmnPardaz, Ashiyane, and shadowy groups such as Ajax and the Iranian Cyber Army is highly focused on such an endeavor. With funding coming from the IRGC, infiltration of classes and as members of academia with Basij members, Sharif University is the main driver of information security and cyber operations in Iran. Below is another of many such examples. Honeypots and how to detect them is available for your review.

It is difficult to find a Master’s degree in CS that doesn’t include coursework on network security in general and honeypots in particular. I spot checked some of the degree’s offered by schools listed at: Best Online Master’s Degrees in Computer Science and found no shortage of information on honeypots.

I recognize the domestic (U.S.) political hysteria surrounding Iran but security decisions based on rumor and unfounded fears aren’t the best ones.

## Project Naptha

May 22nd, 2015

From the webpage:

Project Naptha automatically applies state-of-the-art computer vision algorithms on every image you see while browsing the web. The result is a seamless and intuitive experience, where you can highlight as well as copy and paste and even edit and translate the text formerly trapped within an image.

The homepage has examples of Project Naptha being used on comics, scans, photos, diagrams, Internet memes, screenshots, along with sneak peeks at beta features, such as translation, erase text (from images) and change text. (You can select multiple regions with the shift key.)

This should be especially useful for journalists, bloggers, researchers, basically anyone who spends a lot of time looking for content on the Web.

If the project needs a slogan, I would suggest:

Naptha Frees Information From Image Prisons!

## FBI Director Comey As Uninformed or Not Fair-Minded

May 21st, 2015

From the post:

Earlier today, FBI Director James Comey implied that a broad coalition of technology companies, trade associations, civil society groups, and security experts were either uninformed or were not “fair-minded” in a letter they sent to the President yesterday urging him to reject any legislative proposals that would undermine the adoption of strong encryption by US companies. The letter was signed by dozens of organizations and companies in the latest part of the debate over whether the government should be given built-in access to encrypted data (see, for example, here, here, here, and here for previous iterations).

The comments were made at the Third Annual Cybersecurity Law Institute held at Georgetown University Law Center. The transcript of his encryption-related discussion is below (emphasis added).

A group of tech companies and some prominent folks wrote a letter to the President yesterday that I frankly found depressing. Because their letter contains no acknowledgment that there are societal costs to universal encryption. Look, I recognize the challenges facing our tech companies. Competitive challenges, regulatory challenges overseas, all kinds of challenges. I recognize the benefits of encryption, but I think fair-minded people also have to recognize the costs associated with that. And I read this letter and I think, “Either these folks don’t see what I see or they’re not fair-minded.” And either one of those things is depressing to me. So I’ve just got to continue to have the conversation.

Governments have a long history of abusing citizens and data entrusted to them.

Director Comey is very uninformed if he is unaware of the role that Hollerith machines and census data played in the Holocaust.

Holland embraced Hollerith machines for its 1930’s census, with the same good intentions as FBI Director Comey for access to encrypted data.

France never made effective use of Hollerith machines prior to or during WWII.

What difference did that make?

 Country Jewish Population Deported Murdered Death Ratio Holland 140,000 107,000 102,000 73% France 300,000 to 350,000 85,000 82,000 25%

(IBM and the Holocaust : the strategic alliance between Nazi Germany and America’s most powerful corporation by Edwin Black, at page 332.)

A fifty (50%) percent difference in the effectiveness of government oppression due to technology sounds like a lot to me. Bearing in mind that Comey wants to make government collection of data even more widespread and efficient.

If invoking the Holocaust seems like a reflex action, consider the Hollywood blacklist, the McCarthy era, government inflitration of the peace movement of the 1960’s, oppression of the Black Panther Party, long term discrimination against homosexuals, ongoing discrimination based on race, gender, religion, ethnicity, to name just a few instances of where government intrusion has been facilitated by its data gathering capabilities.

If anyone is being “not fair-minded” in the debate over encryption, it’s FBI Director Comey. The pages of history, past, recent and contemporary, bleed from the gathering and use of data by government at different levels.

Making a huge leap of faith and saying the current government is in good faith, in no way guarantees that a future government will be in good faith.

Strong encryption won’t save us from a corrupt government, but it may give us a fighting chance of a better government.

## Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

May 21st, 2015

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

From the post:

The ancient skill of creating and performing spoken rhyme is thriving today because of the inexorable rise in the popularity of rapping. This art form is distinct from ordinary spoken poetry because it is performed to a beat, often with background music.

And the performers have excelled. Adam Bradley, a professor of English at the University of Colorado has described it in glowing terms. Rapping, he says, crafts “intricate structures of sound and rhyme, creating some of the most scrupulously formal poetry composed today.”

The highly structured nature of rap makes it particularly amenable to computer analysis. And that raises an interesting question: if computers can analyze rap lyrics, can they also generate them?

Today, we get an affirmative answer thanks to the work of Eric Malmi at the University of Aalto in Finland and few pals. These guys have trained a machine-learning algorithm to recognize the salient features of a few lines of rap and then choose another line that rhymes in the same way on the same topic. The result is an algorithm that produces rap lyrics that rival human-generated ones for their complexity of rhyme.

The review is a fun read but I rather like the original paper title as well: DopeLearning: A Computational Approach to Rap Lyrics Generation by Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, Aristides Gionis.

Abstract:

Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow. We present a method for capturing both of these aspects. Our approach is based on two machine-learning techniques: the RankSVM algorithm, and a deep neural network model with a novel structure. For the problem of distinguishing the real next line from a randomly selected one, we achieve an 82 % accuracy. We employ the resulting prediction method for creating new rap lyrics by combining lines from existing songs. In terms of quantitative rhyme density, the produced lyrics outperform best human rappers by 21 %. The results highlight the benefit of our rhyme density metric and our innovative predictor of next lines.

You should also visit BattleBot (a rap engine):

BattleBot is a rap engine which allows you to “spit” any line that comes to your mind after which it will respond to you with a selection of rhyming lines found among 0.5 million lines from existing rap songs. The engine is based on a slightly improved version of the Raplyzer algorithm and the eSpeak speech synthesizer.

You can try out BattleBot simply by hitting “Spit” or “Random”. The latter will randomly pick a line among the whole database of lines and find the rhyming lines for that. The underlined part shows approximately the rhyming part of a result. To understand better, why it’s considered as a rhyme, you can click on the result, see the phonetic transcriptions of your line and the result, and look for matching vowel sequences starting from the end.

BTW, the MIT review concludes with:

What’s more, this and other raps generated by DeepBeat have a rhyming density significantly higher than any human rapper. “DeepBeat outperforms the top human rappers by 21% in terms of length and frequency of the rhymes in the produced lyrics,” they point out.

I can’t help but wonder when DeepBeat is going to hit the charts!

## Top 10 data mining algorithms in plain English

May 21st, 2015

Top 10 data mining algorithms in plain English by Raymond Li.

From the post:

Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let’s get started!

Raymond covers:

Would be nice if we all had a similar ability to explain algorithms!

Enjoy!

May 21st, 2015

Social Media, Financial Algorithms and the Hack Crash by Tero Karppi and Kate Crawford.

Abstract:

@AP: Breaking: Two Explosions in the White House and Barack Obama is injured’. So read a tweet sent from a hacked Associated Press Twitter account @AP, which affected financial markets, wiping out $136.5 billion of the Standard & Poor’s 500 Index’s value. While the speed of the Associated Press hack crash event and the proprietary nature of the algorithms involved make it difficult to make causal claims about the relationship between social media and trading algorithms, we argue that it helps us to critically examine the volatile connections between social media, financial markets, and third parties offering human and algorithmic analysis. By analyzing the commentaries of this event, we highlight two particular currents: one formed by computational processes that mine and analyze Twitter data, and the other being financial algorithms that make automated trades and steer the stock market. We build on sociology of finance together with media theory and focus on the work of Christian Marazzi, Gabriel Tarde and Tony Sampson to analyze the relationship between social media and financial markets. We argue that Twitter and social media are becoming more powerful forces, not just because they connect people or generate new modes of participation, but because they are connecting human communicative spaces to automated computational spaces in ways that are affectively contagious and highly volatile. Social sciences lag behind the computer sciences in making their publications publicly accessible as well as publishing behind firewalls so I can report on is the abstract. On the other hand, I’m not sure how much practical advice you could gain from the article as opposed to the volumes of commentary following the incident itself. The research reminds me of Malcolm Gladwell, author of The Tipping Point and similar works. While I have greatly enjoyed several of Gladwell’s books, including the Tipping Point, it is one thing to look back and say: “Look, there was a tipping point.” It is quite another to be in the present and successfully say: “Look, there is a tipping point and we can make it tip this way or that.” In retrospect, we all credit ourselves with near omniscience when our plans succeed and we invent fanciful explanations about what we knew or realized at the time. Others, equally skilled, dedicated and competent, who started at the same time, did not succeed. Of course, the conservative media (and ourselves if we are honest), invent narratives to explain those outcomes as well. Of course, deliberate manipulation of the market with false information, via Twitter or not, is illegal. The best you can do is look for a pattern of news and/or tweets that result in downward changes in a particular stock, which then recovers and then apply that pattern more broadly. You won’t make$millions off of any one transaction but that is the sort of thing that draws regulatory attention.

## LogJam – Postel’s Law In Action

May 21st, 2015

The seriousness of the LogJam vulnerability was highlighted by John Leyden in Average enterprise ‘using 71 services vulnerable to LogJam’

Based on analysis of 10,000 cloud applications and data from more than 17 million global cloud users, cloud visibility firm Skyhigh Networks reckons that 575 cloud services are potentially vulnerable to man-in-the middle attacks. The average company uses 71 potentially vulnerable cloud services.

[Details from Skyhigh Networks]

The historical source of LogJam?

James Maude, security engineer at Avecto, said that the LogJam flaw shows how internet regulations and architecture decisions made more than 20 years ago are continuing to throw up problems.

“The LogJam issue highlights how far back the long tail of security stretches,” Maude commented. “As new technologies emerge and cryptography hardens, many simply add on new solutions without removing out-dated and vulnerable technologies. This effectively undermines the security model you are trying to build. Several recent vulnerabilities such as POODLE and FREAK have harnessed this type of weakness, tricking clients into using old, less secure forms of encryption,” he added.

Graham Cluley in Logjam vulnerability – what you need to know has better coverage of the history of weak encryption that resulted in the LogJam vulnerability.

What does that have to do with Postel’s Law?

TCP implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others. [RFC761]

As James Maude noted earlier:

As new technologies emerge and cryptography hardens, many simply add on new solutions without removing out-dated and vulnerable technologies.

Probably not what Postel intended at the time but certainly more “robust” in one sense of the word, technologies remain compatible with other technologies that use vulnerable technologies.

In other words, robustness is responsible for the maintenance of weak encryption and hence the current danger from LogJam.

This isn’t an entirely new idea. Eric Allman (Sendmail), warns of security issues with Postel’s Law in The Robustness Principle Reconsidered: Seeking a middle ground:

In 1981, Jon Postel formulated the Robustness Principle, also known as Postel’s Law, as a fundamental implementation guideline for the then-new TCP. The intent of the Robustness Principle was to maximize interoperability between network service implementations, particularly in the face of ambiguous or incomplete specifications. If every implementation of some service that generates some piece of protocol did so using the most conservative interpretation of the specification and every implementation that accepted that piece of protocol interpreted it using the most generous interpretation, then the chance that the two services would be able to talk with each other would be maximized. Experience with the Arpanet had shown that getting independently developed implementations to interoperate was difficult, and since the Internet was expected to be much larger than the Arpanet, the old ad-hoc methods needed to be enhanced.

Although the Robustness Principle was specifically described for implementations of TCP, it was quickly accepted as a good proposition for implementing network protocols in general. Some have applied it to the design of APIs and even programming language design. It’s simple, easy to understand, and intuitively obvious. But is it correct.

For many years the Robustness Principle was accepted dogma, failing more when it was ignored rather than when practiced. In recent years, however, that principle has been challenged. This isn’t because implementers have gotten more stupid, but rather because the world has become more hostile. Two general problem areas are impacted by the Robustness Principle: orderly interoperability and security.

Eric doesn’t come to a definitive conclusion with regard to Postel’s Law but the general case is always difficult to decide.

However, the specific case, supporting encryption known to be vulnerable shouldn’t be.

If there were a programming principles liability checklist, one of the tick boxes should read:

___ Supports (list of encryption schemes), Date:_________

Lawyers doing discovery can compare lists of known vulnerabilities as of the date given for liability purposes.

Programmers would be on notice that supporting encryption with known vulnerabilities is opening the door to legal liability.

## Format String Bug Exploration

May 20th, 2015

Format String Bug Exploration by AJ Kumar.

From the post:

Abstract

The Format String vulnerability significantly introduced in year 2000 when remote hackers gain root access on host running FTP daemon which had anonymous authentication mechanism. This was an entirely new tactics of exploitation the common programming glitches behind the software, and now this deadly threat for the software is everywhere because programmers inadvertently used to make coding loopholes which are targeting none other than Format string attack. The format string vulnerability is an implication of misinterpreting the stack for handling functions with variable arguments especially in Printf function, since this article demonstrates this subtle bug in C programming context on windows operating system. Although, this class of bug is not operating system–specific as with buffer overflow attacks, you can detect vulnerable programs for Mac OS, Linux, and BSD. This article drafted to delve deeper at what format strings are, how they are operate relative to the stack, as well as how they are manipulated in the perspective of C programming language.

Essentials

To be cognizance with the format string bug explained in this article, you will require to having rudimentary knowledge of the C family of programming languages, as well as a basic knowledge of IA32 assembly over window operating system, by mean of visual studio development editor. Moreover, know-how about ‘buffer overflow’ exploitation will definitely add an advantage.

Format String Bug

The format string bug was first explained in June 2000 in a renowned journal. This notorious exploitation tactics enable a hacker to subvert memory stack protections and allow altering arbitrary memory segments by unsolicited writing over there. Overall, the sole cause behind happening is not to handle or properly validated the user-supplied input. Just blindly trusting the used supplied arguments that eventually lead to disaster. Subsequently, when hacker controls arguments of the Printf function, the details in the variable argument lists enable him to analysis or overwrite arbitrary data. The format string bug is unlike buffer overrun; in which no memory stack is being damaged, as well as any data are being corrupted at large extents. Hackers often execute this attack in context of disclosing or retrieving sensitive information from the stack for instance pass keys, cryptographic privates keys etc.

Now the curiosity around here is how exactly the hackers perform this deadly attack. Consider a program where we are trying to produce some string as “kmaraj” over the screen by employing the simple C language library Printf method as;

A bit deeper than most of my post on bugs but the lesson isn’t just the bug, but that it has persisted for going on fifteen (15) years now.

As a matter of fact, Karl Chen and David Wagner in Large-Scale Analysis of Format String Vulnerabilities in Debian Linux (2007) found:

We successfully analyze 66% of C/C++ source packages in the Debian 3.1 Linux distribution. Our system finds 1,533 format string taint warnings. We estimate that 85% of these are true positives, i.e., real bugs; ignoring duplicates from libraries, about 75% are real bugs.

We suggest that the technology exists to render format string vulnerabilities extinct in the near future. (emphasis added)

“…[N]ear future?” Maybe not because Mathias Payer and Thomas R. Gross report in 2013, String Oriented Programming: When ASLR is not Enough:

One different class of bugs has not yet received adequate attention in the context of DEP, stack canaries, and ASLR: format string vulnerabilities. If an attacker controls the first parameter to a function of the printf family, the string is parsed as a format string. Using such a bug and special format markers result in arbitrary memory writes. Existing exploits use format string vulnerabilities to mount stack or heap-based code injection attacks or to set up return oriented programming. Format string vulnerabilities are not a vulnerability of the past but still pose a significant threat (e.g., CVE-2012-0809 reports a format string bug in sudo and allows local privilege escalation; CVE-2012-1152 reports multiple format string bugs in perl-YAML and allows remote exploitation, CVE-2012-2369 reports a format string bug in pidgin-otr and allows remote exploitation) and usually result in full code execution for the attacker.

Should I assume in computer literature six (6) years doesn’t qualify as the “…near future?”

Would liability for string format bugs result in greater effort to avoid the same?

Hard to say in the abstract but the results could hardly be worse than fifteen (15) years of format string bugs.

Not to mention that liability would put the burden of avoiding the bug squarely on the shoulders of those best able to avoid it.

## Math for Journalists Made Easy:…

May 20th, 2015

Math for Journalists Made Easy: Understanding and Using Numbers and Statistics – Sign up now for new MOOC

From the post:

Journalists who squirm at the thought of data calculation, analysis and statistics can arm themselves with new reporting tools during the new Massive Open Online Course (MOOC) from the Knight Center for Journalism in the Americas: “Math for Journalists Made Easy: Understanding and Using Numbers and Statistics” will be taught from June 1 to 28, 2015.

“Math is crucial to things we do every day. From covering budgets to covering crime, we need to understand numbers and statistics,” said course instructor Jennifer LaFleur, senior editor for data journalism for the Center for Investigative Reporting, one of the instructors of the MOOC.

Two other instructors will be teaching this MOOC: Brant Houston, a veteran investigative journalist who is a professor and the Knight Chair in Journalism at the University of Illinois; and freelance journalists Greg Ferenstein, who specializes in the use of numbers and statistics in news stories.

The three instructors will teach journalists “how to be critical about numbers, statistics and research and to avoid being improperly swayed by biased researchers.” The course will also prepare journalists to relay numbers and statistics in ways that are easy for the average reader to understand.

“It is true that many of us became journalists because sometime in our lives we wanted to escape from mathematics, but it is also true that it has never been so important for journalists to overcome any fear or intimidation to learn about numbers and statistics,” said professor Rosental Alves, founder and director of the Knight Center. “There is no way to escape from math anymore, as we are nowadays surrounded by data and we need at least some basic knowledge and tools to understand the numbers.”

The MOOC will be taught over a period of four weeks, from June 1 to 28. Each week focuses on a particular topic taught by a different instructor. The lessons feature video lectures and are accompanied by readings, quizzes and discussion forums.

This looks excellent.

I will be looking forward to very tough questions of government and corporate statistical reports from anyone who takes this course.

## A Call for Collaboration: Data Mining in Cross-Border Investigations

May 20th, 2015

A Call for Collaboration: Data Mining in Cross-Border Investigations by Jonathan Stray and Drew Sullivan.

From the post:

Over the past few years we have seen the huge potential of data and document mining in investigative journalism. Tech savvy networks of journalists such as the Organized Crime and Corruption Reporting Project (OCCRP) and the International Consortium of Investigative Journalists (ICIJ) have teamed together for astounding cross-border investigations, such as OCCRP’s work on money laundering or ICIJ’s offshore leak projects. OCCRP has even incubated its own tools, such as VIS, Investigative Dashboard and Overview.

But we need to do better. There is enormous duplication and missed opportunity in investigative journalism software. Many small grants for technology development have led to many new tools, but very few have become widely used. For example, there are now over 70 tools just for social network analysis. There are other tools for other types of analysis, document handling, data cleaning, and on and on. Most of these are open source, and in various states of completeness, usability, and adoption. Developer teams lack critical capacities such as usability testing, agile processes, and business development for sustainability. Many of these tools are beautiful solutions in search of a problem.

The fragmentation of software development for investigative journalism has consequences: Most newsrooms still lack capacity for very basic knowledge management tasks, such as digitally filing new documents where they can be searched and found later. Tools do not work or do not inter-operate. Ultimately the reporting work is slower, or more expensive, or doesn’t get done. Meanwhile, the commercial software world has so far ignored investigative journalism because it is a small, specialized user-base. Tools like Nuix and Palantir are expensive, not networked, and not extensible for the inevitable story-specific needs.

But investigative journalists have learned how to work in cross-border networks, and investigative journalism developers can too. The experience gained from collaborative data-driven journalism has led OCCRP and other interested organizations to focus on the following issues:

The issues:

• Usability
• Delivery
• Networked Investigation
• Sustainability
• Interoperability and extensibility

The next step is reported to be:

The next step for us is a small meeting: the very first conference on Knowledge Management in Investigative Journalism. This event will bring together key developers and journalists to refine the problem definition and plan a way forward. OCCRP and the Influence Mappers project have already pledged support. Stay tuned…

Jonathan Stray jonathanstray@gmail.comand and Drew Sullivan drew@occrp.org, want to know if you are interested too?

See the original post, email Jonathan and Drew if you are interested. It sounds like a very good idea to me.

PS: You already know one of the technologies that I think is important for knowledge management: topic maps!

## H2O 3.0

May 20th, 2015

H20 3.0

From the webpage:

Why H2O?

H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine.

Get H2O!

What is H2O?

H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:

• Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like ApacheTM Hadoop® and SparkTM to give customers the flexibility to solve their most challenging data problems.
• Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
• Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
• Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
• Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Note the caveat near the bottom of the page:

With H2O, you can:

• Make better predictions. Harness sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.
• Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

The operative word being “can.” Your results with H2O depend upon your knowledge of machine learning, knowledge of your data and the effort you put into using H2O, among other things.

May 20th, 2015

From the webpage:

On May 20, 2015, the ODNI released a sizeable tranche of documents recovered during the raid on the compound used to hide Usama bin Ladin. The release, which followed a rigorous interagency review, aligns with the President’s call for increased transparency–consistent with national security prerogatives–and the 2014 Intelligence Authorization Act, which required the ODNI to conduct a review of the documents for release.

The release contains two sections. The first is a list of non-classified, English-language material found in and around the compound. The second is a selection of now-declassified documents.

The Intelligence Community will be reviewing hundreds more documents in the near future for possible declassification and release. An interagency taskforce under the auspices of the White House and with the agreement of the DNI is reviewing all documents which supported disseminated intelligence cables, as well as other relevant material found around the compound. All documents whose publication will not hurt ongoing operations against al-Qa‘ida or their affiliates will be released.

From the website:

The one expected work missing from Bin Laden’s library?

Possession of the same books as Bin Laden will be taken as a sign terrorist sympathies. Weed your collection responsibly.

## Political Futures Tracker

May 20th, 2015

From the webpage:

The Political Futures Tracker tells us the top political themes, how positive or negative people feel about them, and how far parties and politicians are looking to the future.

This software will use ground breaking language analysis methods to examine data from Twitter, party websites and speeches. We will also be conducting live analysis on the TV debates running over the next month, seeing how the public respond to what politicians are saying in real time. Leading up to the 2015 UK General Election we will be looking across the political spectrum for emerging trends and innovation insights.

If that sounds interesting, consider the following from: Introducing… the Political Futures Tracker:

We are exploring new ways to analyse a large amount of data from various sources. It is expected that both the amount of data and the speed that it is produced will increase dramatically the closer we get to election date. Using a semi-automatic approach, text analytics technology will sift through content and extract the relevant information. This will then be examined and analysed by the team at Nesta to enable delivery of key insights into hotly debated issues and the polarisation of political opinion around them.

The team at the University of Sheffield has extensive experience in the area of social media analytics and Natural Language Processing (NLP). Technical implementation has started already, firstly with data collection which includes following the Twitter accounts of existing MPs and political parties. Once party candidate lists become available, data harvesting will be expanded accordingly.

In parallel, we are customising the University of Sheffield’s General Architecture for Text Engineering (GATE); an open source text analytics tool, in order to identify sentiment-bearing and future thinking tweets, as well as key target topics within these.

One thing we’re particularly interested in is future thinking. We describe this as making statements concerning events or issues in the future. Given these measures and the views expressed by a certain person, we can model how forward thinking that person is in general, and on particular issues, also comparing this with other people. Sentiment, topics, and opinions will then be aggregated and tracked over time.

Personally I suspect that “future thinking” is used in difference senses by the general population and political candidates. For a political candidate, however the rhetoric is worded, the “future” consists of reaching election day with 50% plus 1 vote. For the general population, the “future” probably includes a longer time span.

I mention this in case you can sell someone on the notion that what political candidates say today has some relevance to what they will do after election. President Obmana has been in office for six (6) years on office, the Guantanamo Bay detention camp remains open, no one has been held accountable for years of illegal spying on U.S. citizens, banks and other corporate interests have all but been granted keys to the U.S. Treasury, to name a few items inconsistent with his previous “future thinking.”

Unless you accept my suggestion that “future thinking” for a politician means election day and no further.

## Analysis of named entity recognition and linking for tweets

May 20th, 2015

Abstract:

Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.

The questions addressed by the paper are:

RQ1 How robust are state-of-the-art named entity recognition and linking methods on short and noisy microblog texts?

RQ2 What problem areas are there in recognising named entities in microblog posts, and what are the major causes of false negatives and false positives?

RQ3 Which problems need to be solved in order to further the state-of-the-art in NER and NEL on this difficult text genre?

The ultimate conclusion is that entity recognition in microblog posts falls short of what has been achieved for newswire text but if you need results now or at least by tomorrow, this is a good guide to what is possible and where improvements can be made.

## Detecting Deception Strategies [Godsend for the 2016 Election Cycle]

May 20th, 2015

Discriminative Models for Predicting Deception Strategies by Scott Appling, Erica Briscoe, C.J. Hutto.

Abstract:

Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsifi cation), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.

The deception strategies studied in this paper:

• Falsification
• Exaggeration
• Omission

especially omission, will form the bulk of the content in the 2016 election cycle in the United States. Only deceptive statements were included in the test data, so the models were tested on correctly recognizing the deception strategy in a known deceptive statement.

The test data is remarkably similar to political content, which aside from their names and names of their opponents (mostly), is composed entirely of deceptive statements, albeit not marked for the strategy used in each one.

A web interface for loading pointers to video, audio or text with political content that emits tagged deception with pointers to additional information would be a real hit for the next U.S. election cycle. Monetize with ads, the sources of additional information, etc.

I first saw this in a tweet by Leon Derczynski.

## New Computer Bug Exposes Broad Security Flaws [Trust but Verify]

May 20th, 2015

From the post:

A dilemma this spring for engineers at big tech companies, including Google Inc., Apple Inc. and Microsoft Corp., shows the difficulty of protecting Internet users from hackers.

Internet-security experts crafted a fix for a previously undisclosed bug in security tools used by all modern Web browsers. But deploying the fix could break the Internet for thousands of websites.

“It’s a twitchy business, and we try to be careful,” said Richard Barnes, who worked on the problem as the security lead for Mozilla Corp., maker of the Firefox Web browser. “The question is: How do you come up with a solution that gets as much security as you can without causing a lot of disruption to the Internet?”

Engineers at browser makers traded messages for two months, ultimately choosing a fix that could make more than 20,000 websites unreachable. All of the browser makers have released updates including the fix or will soon, company representatives said.
No links or pointers to further resources.

The name of this new bug is “Logjam.”

I saw Jennifer’s story on Monday evening, about 19:45 ESDT and tried to verify the story with some of the standard bug reporting services.

No “hits” at CERT, IBM’s X-Force, or the Internet Storm Center as of 20:46 ESDT on May 19, 2015.

The problem being that Jennifer did not include any links to any source that would verify the existence of this new bug. Not one.

The only story that kept popping up in searches was Jennifer’s.

So, I put this post to one side, returning to it this morning.

As of this morning, now about 6:55 ESDT, the Internet Storm Center returns:

Logjam – vulnerabilities in Diffie-Hellman key exchange affect browsers and servers using TLS by Brad Duncan, ISC Handler and Security Researcher at Rackspace, with a pointer to: The Logjam Attack, which reads in part:

We have published a technical report, Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice, which has specifics on these attacks, details on how we broke the most common 512-bit Diffie-Hellman Group, and measurements of who is affected. We have also published several proof of concept demos and a Guide to Deploying Diffie-Hellman for TLS

This study was performed by computer scientists at Inria Nancy-Grand Est, Inria Paris-Rocquencourt, Microsoft Research, Johns Hopkins University, University of Michigan, and the University of Pennsylvania: David Adrian, Karthikeyan Bhargavan, Zakir Durumeric, Pierrick Gaudry, Matthew Green, J. Alex Halderman, Nadia Heninger, Drew Springall, Emmanuel Thomé, Luke Valenta, Benjamin VanderSloot, Eric Wustrow, Santiago Zanella-Beguelin, and Paul Zimmermann. The team can be contacted at weak-dh@umich.edu.

As of 7:06 ESDT on May 20, 2015, neither CERT nor IBM’s X-Force returns any “hits” on “Logjam.”

It is one thing to “trust” a report of a bug, but please verify before replicating a story based upon insider gossip. Links to third party materials for example.

## Fighting Cybercrime at IBM

May 19th, 2015

15/05/15 – More than 1,000 Organizations Join IBM to Battle Cybercrime

From the post:

ARMONK, NY – 14 May 2015: IBM (NYSE: IBM) today announced that more than 1,000 organizations across 16 industries are participating in its X-Force Exchange threat intelligence network, just one month after its launch. IBM X-Force Exchange provides open access to historical and real-time data feeds of threat intelligence, including reports of live attacks from IBM’s global threat monitoring network, enabling enterprises to defend against cybercrime.

IBM’s new cloud-based cyberthreat network, powered by IBM Cloud, is designed to foster broader industry collaboration by sharing actionable data to defend against these very real threats to businesses and governments. The company provided free access last month, via the X-Force Exchange, to its 700 terabyte threat database – a volume equivalent to all data that flows across the internet in two days. This includes two decades of malicious cyberattack data from IBM, as well as anonymous threat data from the thousands of organizations for which IBM manages security operations. Participants have created more than 300 new collections of threat data in the last month alone.

“Cybercrime has become the equivalent of a pandemic — no company or country can battle it alone,” said Brendan Hannigan, General Manager, IBM Security. ““We have to take a collective and collaborative approach across the public and private sectors to defend against cybercrime. Sharing and innovating around threat data is central to battling highly organized cybercriminals; the industry can no longer afford to keep this critical resource locked up in proprietary databases. With X-Force Exchange, IBM has opened access to our extensive threat data to advance collaboration and help public and private enterprises safeguard themselves.”

Think about the numbers for a moment, 1,000 organizations and 300 new collections of threat data in a month. Not bad by anyone’s yardstick.

As I titled my first post on the X-Force Exchange: Being Thankful IBM is IBM.

## Civil War Navies Bookworm

May 19th, 2015

Civil War Navies Bookworm by Abby Mullen.

From the post:

If you read my last post, you know that this semester I engaged in building a Bookworm using a government document collection. My professor challenged me to try my system for parsing the documents on a different, larger collection of government documents. The collection I chose to work with is the Official Records of the Union and Confederate Navies. My Barbary Bookworm took me all semester to build; this Civil War navies Bookworm took me less than a day. I learned things from making the first one!

This collection is significantly larger than the Barbary Wars collection—26 volumes, as opposed to 6. It encompasses roughly the same time span, but 13 times as many words. Though it is still technically feasible to read through all 26 volumes, this collection is perhaps a better candidate for distant reading than my first corpus.

The document collection is broken into geographical sections, the Atlantic Squadron, the West Gulf Blockading Squadron, and so on. Using the Bookworm allows us to look at the words in these documents sequentially by date instead of having to go back and forth between different volumes to get a sense of what was going on in the whole navy at any given time.

The earlier post: Text Analysis on the Documents of the Barbary Wars

More details on Bookworm.

As with all ngram viewers, exercise caution in assuming a text string has uniform semantics across historical, ethnic, or cultural fault lines.

## Mile High Club

May 19th, 2015

From the post:

A very elite club was just created by Chris Roberts, if his allegations of commandeering an airplane are true. Modern day transportation relies heavily on remote access to the outside world…and consumer trust. These two things have been at odds recently, ever since the world read a tweet from Chris Roberts, in which he jokingly suggested releasing oxygen masks while aboard a commercial flight. Whether or not Roberts was actually joking about hacking the aircraft is up for debate, but the move led the Government Accountability Office to issue a warning about potential vulnerabilities to aircraft systems via in-flight Wi-Fi.

Chris has a great suggestion:

While I agree that we don’t want every 16-year-old script kiddie trying to tamper with people’s lives at 35,000 feet, we do wonder if United or any of the other major carriers would be willing to park a plane at Black Hat. Surely if they were certain that there is no way to exploit the pilot’s aviation systems, they would be willing to allow expert researchers to have a look while the plane is on the ground? Tremendous insight and overall global information security could only improve if a major carrier or manufacturer hosted a hack week on a Dreamliner on the tarmac at McCarran international.

A couple of candidate Black Hat conferences:

BLACK HAT | USA August 1-6, 2015 | Mandalay Bay | Las Vegas, NV

BLACK HAT | EUROPE November 10-13, 2015 | Amsterdam RAI | The Netherlands

Do you think the conference organizers would comp registration for the people who come with the plane?

As far as airlines, The top ten (10) in income (US) for 2014:

When you register for a major Black Hat conference, ask the organizers to stage an airline hacking event. Especially on:

Big Black Hat Conference signs with the make/model on them for the tarmac.

Would make an interesting press event to have a photo of the conference sign with no plane.

Sorta speaks for itself. Yes?

## The Back-to-Basics Readings of 2012

May 19th, 2015

The Back-to-Basics Readings of 2012 by Werner Vogels (CTO – Amazon.com).

From the post:

After the AWS re: Invent conference I spent two weeks in Europe for the last customer visits of the year. I have since returned and am now in New York City enjoying a few days of winding down the last activities of the year before spending the holidays here with family. Do not expect too many blog posts or twitter updates. Although there are still a few very exciting AWS news updates to happen this year.

I thought this was a good moment to collect all the readings I suggested this year in one summary post. It was not until later in the year that I started to recording the readings here on the blog, so I hope this is indeed the complete list. I am pretty sure some if not all of these papers deserved to be elected to the hall of fame of best papers in distributed systems.

My count is twenty-four (24) papers. More than enough for a weekend at the beach!

I first saw this in a tweet by Computer Science.

## NY police commissioner wants 450 more cops to hunt Unicorns

May 19th, 2015

OK, not exactly what Police Commissioner Bratton said but it may as well be.

Bratton is quoted in the post as saying:

As the fear over the threat of terrorism continues to swell around the world, New York City becomes increasingly on edge that it’s time to take extra security precautions.

Although we have not experienced the caliber such as the attacks in Paris, we have a history of being a major target and ISIS has already begun to infiltrate young minds through the use of video games and social media.

Since the New Year there have been numerous arrests in Brooklyn and Queens for people attempting to assist ISIS from afar, building homemade bombs and laying out plans of attack.

This is called no-evidence police policy.

Particularly when you examine the “facts” behind:

…numerous arrests in Brooklyn and Queens for people attempting to assist ISIS from afar, building homemade bombs and laying out plans of attack.

Queens? Oh, yes, the two women from Queens who an FBI informant befriended by finding a copy of the Anarchist Cookbook online and printing it out for them. Not to mention taking one of them shopping for bomb components. The alleged terrorists were going to educate themselves on bomb making. More a threat to themselves than anyone else. e451 and Senator Dianne Feinstein The focus of that post is on the Anarchist Cookbook but you can get the drift of how silly the compliant was in fact.

As far as Brooklyn, you can read the complaint for yourself but the gist of it was one of the three defendants could not travel because his mother would not give him his passport. Serious terrorist people we are dealing with here. The other two were terrorist wannabe’s who long on boosting skills but there isn’t a shortage of boosters in ISIS. Had they been able to connect with ISIS by some happenstance, it would have degraded the operational capabilities of ISIS, not assisted it.

A recent estimate for the Muslim population of New York puts the total number of Muslims at 600,000. 175 Mosques in New York City and counting. Muslims in New York City, Part II [2010]

The FBI was able to assist and prompt five (5) people out of an estimated 600,000 into making boosts about assisting ISIS and/or traveling to join ISIS.

I won’t even both doing the math. Anyone who is looking for facts will know that five (5) arrests in a city of 12 million people, doesn’t qualify as “numerous.” Especially when those arrests were for thought crimes and amateurish boosting more than any attempt at an actual crime.

Support more cops to hunt Unicorns. Unicorn hunting doesn’t require military tactics or weapons, thereby making the civilian population safer.

## Fast parallel computing with Intel Phi coprocessors

May 19th, 2015

Fast parallel computing with Intel Phi coprocessors by Andrew Ekstrom.

Andrew tells a tale of going from more than a week processing a 10,000×10,000 matrix raised to 10^17 to 6-8 hours and then substantially shorter times. Sigh, using Windows but still an impressive feat! As you might expect, using Revolution Analytics RRO, Intel’s Math Kernel Library (MKL), Intel Phi coprocessors, etc.

There’s enough detail (I suspect) for you to duplicate this feat on your own Windows box, or perhaps more easily on Linux.

Enjoy!

I first saw this in a tweet by David Smith.

## The Applications of Probability to Cryptography

May 19th, 2015

The underlying manuscript is held by the National Archives in the UK and can be accessed at www.nationalarchives.gov.uk using reference number HW 25/37. Readers are encouraged to obtain a copy.

The original work was under Crown copyright, which has now expired, and the work is now in the public domain.

You can go directly to the record page: http://discovery.nationalarchives.gov.uk/details/r/C11510465.