Archive for September, 2015

Christmas in October? (Economics of Cybersecurity)

Tuesday, September 22nd, 2015

From the post:

If there’s something which is in high demand from both the common internet criminals and intelligence agencies around the world, it’s a way of easily infecting the iPhones and iPads of individuals.

The proof that there is high demand for a way to remotely and reliably exploit iOS devices, in order to install malware that can spy upon communications and snoop upon a user’s whereabouts, is proven by a staggering $1 million reward being offered by one firm for exclusive details of such a flaw. In an announcement on its website, newly-founded vulnerability broker Zerodium, offers the million dollar bounty to “each individual or team who creates and submits an exclusive, browser-based, and untethered jailbreak for the latest Apple iOS 9 operating system and devices.” There’s no denying – that’s a lot of cash. And Zerodium says it won’t stop there. In fact, it says that it will offer a grand total of$3 million in rewards for iOS 9 exploits and jailbreaks.

Graham says the most likely buyers from Zerodium are governments more likely to pay large sums than Microsoft or Apple.

There a reason for that. Microsoft, Apple, Cisco, etc., face no economic down side from zero-day exploits.

Zero-day exploits tarnish reputations or so it is claimed. For most vendors it would be hard to find another black mark in addition to all the existing ones.

If zero-day exploits had an impact on sales, the current vendor landscape would be far different than it is today.

With no economic impact on sales or reputations, it is easy to understand the complacency of vendors in the face of zero-day exploits and contests to create the same.

I keep using the phrase “economic impact on” to distinguish economic consequences from all the hand wringing and tough talk you hear from vendors about cybersecurity. Unless and until something impacts the bottom line on a balance sheet, all the talk is just cant.

If some legislative body, Congress (in the U.S.) comes to mind, were to pass legislation that:

• Imposes strict liability for all code level vulnerabilities
• Establishes a minimum level of presumed damages plus court costs and attorneys fees
• A expedited process for resolving claims within six months
• Establish tax credits for zero-day exploits purchased by vendors

the economics of cybersecurity would change significantly.

Vendors would have economic incentives to both write cleaner code and to purchase zero-day exploits on the open market.

Hackers would have economic incentives to find hacks because there is automatic liability on the part of software vendors for their exploits.

The time has come to end the free ride for software vendors on the issue of liability for software exploits.

The result will be a safer world for everyone.

Python & R codes for Machine Learning

Monday, September 21st, 2015

While I am thinking about machine learning, I wanted to mention: Cheatsheet – Python & R codes for common Machine Learning Algorithms by Manish Saraswat.

From the post:

In his famous book – Think and Grow Rich, Napolean Hill narrates story of Darby, who after digging for a gold vein for a few years walks away from it when he was three feet away from it!

Now, I don’t know whether the story is true or false. But, I surely know of a few Data Darby around me. These people understand the purpose of machine learning, its execution and use just a set 2 – 3 algorithms on whatever problem they are working on. They don’t update themselves with better algorithms or techniques, because they are too tough or they are time consuming.

Like Darby, they are surely missing from a lot of action after reaching this close! In the end, they give up on machine learning by saying it is very computation heavy or it is very difficult or I can’t improve my models above a threshold – what’s the point? Have you heard them?

Today’s cheat sheet aims to change a few Data Darby’s to machine learning advocates. Here’s a collection of 10 most commonly used machine learning algorithms with their codes in Python and R. Considering the rising usage of machine learning in building models, this cheat sheet is good to act as a code guide to help you bring these machine learning algorithms to use. Good Luck!

Here’s a very good idea! Whether you want to learn these algorithms or a new Emacs mode. 😉

Sure, you can always look up the answer but that breaks your chain of thought, over and over again.

Enjoy!

Machine-Learning-Cheat-Sheet [Cheating Machine Learning?]

Monday, September 21st, 2015

From the Preface:

This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly recall knowledge and ideas in machine learning.

This cheat sheet has three significant advantages:

1. Strong typed. Compared to programming languages, mathematical formulas are weakly typed. For example, X can be a set, a random variable, or a matrix. This causes difficulty in understanding the meaning of formulas. In this cheat sheet, I try my best to standardize symbols used, see section §.

2. More parentheses. In machine learning, authors are prone to omit parentheses, brackets and braces, this usually causes ambiguity in mathematical formulas. In this cheat sheet, I use parentheses(brackets and braces) at where they are needed, to make formulas easy to understand.

3. Less thinking jumps. In many books, authors are prone to omit some steps that are trivial in his option. But it often makes readers get lost in the middle way of derivation.

Two other advantages of this “cheat-sheet” are that it resides on Github and is written using the Springer LaTeX template.

Neural networks can be easily fooled, Deep Neural Networks are Easily Fooled:… so the question becomes, how easy is it to fool the machine learning algorithms summarized by Frank Dai?

Or to put it another way, if I know the machine algorithm most likely to be used, what steps, if any, can I take to shape data to influence the likely outcome?

Excluding outright false data because that would be too easily detected and possibly trip too many alarms.

The more you know about how an algorithm can be cheated, the safer you will be in evaluating the machine learning results of others.

I first saw this in a tweet by Kirk Borne.

Are You Deep Mining Shallow Data?

Monday, September 21st, 2015

Do you remember this verse of Simple Simon?

Simple Simon went a-fishing,

For to catch a whale;

All the water he had got,

Was in his mother’s pail.

Shallow data?

To illustrate, fill in the following statement:

My mom makes the best _____.

Before completing that statement, you resolved the common noun, “mom,” differently that I did.

The string carries no clue as to the resolution of “mom” by any reader.

The string also gives no clues as to how it would be written in another language.

With a string, all you get is the string, or in other words:

All strings are shallow.

That applies to the strings we use to add depth to strings but we will reach that issue shortly.

One of the few things that RDF got right was:

…RDF puts the information in a formal way that a machine can understand. The purpose of RDF is to provide an encoding and interpretation mechanism so that resources can be described in a way that particular software can understand it; in other words, so that software can access and use information that it otherwise couldn’t use. (quote from Wikipedia on RDF)

In addition to the string, RDF posits an identifier in the form of a URI which you can follow to discover more information about that portion of string.

Unfortunately RDF was burdened by the need for all new identifiers to replace those already in place, an inability to easily distinguish identifier URIs from URIs that lead to subjects of conversation, and encoding requirements that reduced the population of potential RDF authors to a righteous remnant.

Despite its limitations and architectural flaws, RDF is evidence that strings are indeed shallow. Not to mention that if we could give strings depth, their usefulness would be greatly increased.

One method for imputing more depth to strings is natural language processing (NLP). Modern NLP techniques are based on statistical analysis of large data sets and are the most accurate for very common cases. The statistical nature of NLP makes application of those techniques to very small amounts of text or ones with unusual styles of usage problematic.

The limits of statistical techniques isn’t a criticism of NLP but rather an observation that depending on the level of accuracy desired and your data, such techniques may or may not be useful.

What is acceptable for imputing depth to strings in movie reviews is unlikely to be thought so when deciphering a manual for disassembling an atomic weapon. The question isn’t whether NLP can impute depth to strings but whether that imputation is sufficiently accurate for your use case.

Of course, RDF and NLP aren’t the only two means for imputing depth to strings.

We will take up another method for giving strings depth tomorrow.

Announcing Spark 1.5

Sunday, September 20th, 2015

Announcing Spark 1.5 by Reynold Xin and Patrick Wendell.

From the post:

Today we are happy to announce the availability of Apache Spark’s 1.5 release! In this post, we outline the major development themes in Spark 1.5 and some of the new features we are most excited about. In the coming weeks, our blog will feature more detailed posts on specific components of Spark 1.5. For a comprehensive list of features in Spark 1.5, you can also find the detailed Apache release notes below.

Many of the major changes in Spark 1.5 are under-the-hood changes to improve Spark’s performance, usability, and operational stability. Spark 1.5 ships major pieces of Project Tungsten, an initiative focused on increasing Spark’s performance through several low-level architectural optimizations. The release also adds operational features for the streaming component, such as backpressure support. Another major theme of this release is data science: Spark 1.5 ships several new machine learning algorithms and utilities, and extends Spark’s new R API.

One interesting tidbit is that in Spark 1.5, we have crossed the 10,000 mark for JIRA number (i.e. more than 10,000 tickets have been filed to request features or report bugs). Hopefully the added digit won’t slow down our development too much!

Enjoy!

10 Misconceptions about Neural Networks [Update to car numberplate game?]

Saturday, September 19th, 2015

From the post:

Neural networks are one of the most popular and powerful classes of machine learning algorithms. In quantitative finance neural networks are often used for time-series forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling. They have also been used to construct stochastic process models and price derivatives. Despite their usefulness neural networks tend to have a bad reputation because their performance is “temperamental”. In my opinion this can be attributed to poor network design owing to misconceptions regarding how neural networks work. This article discusses some of those misconceptions.

The car numberplate game was a game where passengers in a car, usually children, would compete to find license plates from different states (in the US). That was prior to children being entombed in intellectual isolation bubbles with iPads, Gameboys, DVD players and wireless access, while riding.

Hard to believe but some people used to look outside the vehicle in which they were riding. Now of course what little attention they have is captured by cellphones and not other occupants of the same vehicle.

Rather than rail against that trend, may I suggest we update the car numberplate game to “mistakes about neural networks?”

Using Stuart’s post as a baseline, send a text message to each passenger pointing to Stuart’s post and requesting a count of the number of “mistakes about neural networks” they can find in an hour.

Personally I would put popular media off limits for post-high school players to keep the scores under four digits.

When discussing the scores, after sharing browsing histories, each player has to analyze the claimed error and match it to one on Stuart’s list.

I realize that will require full bandwidth communication with others in your physical presence but with practice, that won’t seem so terribly odd.

I first saw this in a tweet by Kirk Borne.

Tor relay turned back on after unanimous library vote

Friday, September 18th, 2015

From the post:

Live free or die.

That, possibly the most well-known of US state mottos, is declared on vehicle license plates throughout the verdant, mountainous, cantankerous state of New Hampshire.

True to that in-your-face independence, on Tuesday evening, in the New Hampshire town of Lebanon, the Lebanon Libraries board unanimously seized freedom and privacy by flipping the bird to the Department of Homeland Security (DHS) and its grudge against the Tor network.

Dozens of community members had come to the meeting to chime in on whether the Kilton Public Library should go ahead with a project to set up a Tor relay: a project that was shelved after a DHS agent reached out to warn New Hampshire police – or, as some classify it, spread FUD – that Tor shields criminals.

Boston librarian Alison Macrina, the mastermind behind the Library Freedom Project (LFP) and its plan to install exit nodes in libraries in collaboration with the Tor Project, said in an article on Slate (co-authored with digital rights activist April Glaser) that the unanimous vote to reinstate the library’s Tor relay was greeted with enthusiasm:

When library director Sean Fleming declared that the relay would go back online, a huge round of applause rang out. The citizens of Lebanon fought to protect privacy and intellectual freedom from the Department of Homeland Security’s intimidation tactics - and they won.

One bright spot of news in the flood of paranoid reports concerning terrorism and government demands for greater surveillance of everyone.

If you aren’t running Tor you should be.

Privacy is everyone’s concern.

Introduction to d3.js [including maps!]

Friday, September 18th, 2015

Introduction to d3.js with Mike Taptich.

From part one of a two-part introduction to d3.js:

Don’t know Javascript? No problem.

This two-part introduction to d3.js is intended for beginners, even those with limited exposure to JavaScript — the language used by web browsers. Regardless of your background, I put together a few examples to get everyone (re)oriented to coding in the browser.

Sounds like your typical introduction to …. type tutorial until you reach:

My Coding Philosophy

Now, I am a firm believer in coding with a purpose. When you are first starting off learning JavaScript, you may want to learn everything there is to learn about the language (broad). I suggest that you instead only pick up the pieces of code you need in order to complete some bigger project (narrow). To give you a glimpse of what you could do in d3, check out Mike Bostock’s github repo here or Christopher Viau’s well-curated repository here.

You don’t know at that point but Part 2 is going to focus on maps and displaying data on maps! Some simple examples and some not quite so simple.

D3.js will do a lot of things but you can get a tangible result for many use cases by mapping data to maps.

Some random examples:

• Where in your city have the police shot civilians?
• Are there any neighborhoods where no one has been shot by the police?
• Restaurants by ethnic cuisine.
• Pothole incidence rates.
• Frequency of police patrols.

This rocks!

I first saw this in a tweet by Christophe Viau.

Office of Personnel Management – Update

Thursday, September 17th, 2015

Interim Status Report on OPM’s Responses to the Flash Audit Alert by Patrick E. McFarland, Inspector General.

You may have missed the interim report from Inspector General Patrick McFarland on the OPM’s efforts to correct its ongoing cybersecurity issues.

The question that wasn’t asked during the most recent Republican clown car debate was how each contender would respond to an agency such as the OPM, which:

…rejects the recommendation to adopt project management best practices, and states that it adheres to its own system development lifecycle policy which is based on federal standards.

Does that sound familiar in your organization?

Solving the command and control issues over OPM are a necessary first step towards removing the OPM as a government wide security risk.

Constitution Day – The Annotated Constitution Celebrated

Thursday, September 17th, 2015

From the post:

Thursday, September 17th is Constitution Day and on this date we commemorate the signing of the Constitution. This day also recognizes those who have become citizens of the United States by coming of age or by naturalization. The Law Library frequently celebrates this auspicious day with a lecture or scholarly debate. Over the years we have written about different aspects of the Constitution, its history and various Constitutional amendments. This year I thought it would be helpful to highlight one of our most important resources in answering questions about the Constitution and its history. What is this invaluable resource? It is The Constitution of the United States of America: Analysis and Interpretation.

This publication, which celebrated its centennial in 2013, is available both in print and online. At the direction of the Librarian of Congress, this publication is prepared by staff from the Congressional Research Service and, since at least 1964, it has been published as a Senate document. Many of the staff here at the Law Library have an older edition of the print publication in our offices, and there are always two or three current editions available in the Law Library Reading Room.

Despite my antipathy for some government departments and activities, I have a weakness for the Library of Congress in general and the Congressional Research Service in particular.

The 1972 edition of the The Constitution of the United States of America: Analysis and Interpretation was my first exposure to the first of many Congressional Research Service publications.

You won’t have to spend long with a current edition to discover that “interpreting” the Constitution isn’t as nearly straight forward and unambiguous as many claim.

The 2014 edition runs 2814 pages long. A bit unwieldy in print so I will be reading my next copy on an e-reader.

Elliptic Curve Cryptography: a gentle introduction

Wednesday, September 16th, 2015

Elliptic Curve Cryptography: a gentle introduction by Andrea Corbellini.

From the post:

Those of you who know what public-key cryptography is may have already heard of ECC, ECDH or ECDSA. The first is an acronym for Elliptic Curve Cryptography, the others are names for algorithms based on it.

Today, we can find elliptic curves cryptosystems in TLS, PGP and SSH, which are just three of the main technologies on which the modern web and IT world are based. Not to mention Bitcoin and other cryptocurrencies.

Before ECC become popular, almost all public-key algorithms were based on RSA, DSA, and DH, alternative cryptosystems based on modular arithmetic. RSA and friends are still very important today, and often are used alongside ECC. However, while the magic behind RSA and friends can be easily explained, is widely understood, and rough implementations can be written quite easily, the foundations of ECC are still a mystery to most.

With a series of blog posts I’m going to give you a gentle introduction to the world of elliptic curve cryptography. My aim is not to provide a complete and detailed guide to ECC (the web is full of information on the subject), but to provide a simple overview of what ECC is and why it is considered secure, without losing time on long mathematical proofs or boring implementation details. I will also give helpful examples together with visual interactive tools and scripts to play with.

Specifically, here are the topics I’ll touch:

In order to understand what’s written here, you’ll need to know some basic stuff of set theory, geometry and modular arithmetic, and have familiarity with symmetric and asymmetric cryptography. Lastly, you need to have a clear idea of what an “easy” problem is, what a “hard” problem is, and their roles in cryptography.

Whether you can make it through this series of posts or not, it remains a great URL to have show up in a public terminal’s web browsing history.

Even if you aren’t planning on “going dark,” you can do your part to create noise that will cover those who do.

Take the opportunity to visit this site and other cryptography resources. Like the frozen North, they may not be around for your grandchildren to see.

Theoretical Encryption Horror Stories

Wednesday, September 16th, 2015

Jenna reports the best quote I have seen from FBI Director James Comey on the criminals “going dark:”

Previous examples provided by FBI Director James Comey in October to illustrate the dangers of “going dark” turned out to be almost laughable. Comey acknowledged at the time that he had “asked my folks just to canvas” for examples he could use, “but I don’t think I’ve found that one yet.” Then he immediately added: “I’m not looking.”

Jenna’s post should be read verbatim into every committee, sub-committee and other hearing conducted by Congress on encryption issues.

What is more disturbing than the FBI lacking evidence for its position on encryption and neglecting to even see if it exists, is that FBI representatives are still appear as witnesses in court, before Congress and are taken seriously by the news media.

What other group could admit that their “facts” were in truth fantasies and still be taken seriously by anyone?

The FBI should return to the pursuit of legitimate criminals (of which there appears to be no shortage) or be ignored and disbelieved by everyone starting with judges and ending with the news media.

Creating a genetic algorithm for beginners

Wednesday, September 16th, 2015

Creating a genetic algorithm for beginners by Lee Jacobson.

From the post:

A genetic algorithm (GA) is great for finding solutions to complex search problems. They’re often used in fields such as engineering to create incredibly high quality products thanks to their ability to search a through a huge combination of parameters to find the best match. For example, they can search through different combinations of materials and designs to find the perfect combination of both which could result in a stronger, lighter and overall, better final product. They can also be used to design computer algorithms, to schedule tasks, and to solve other optimization problems. Genetic algorithms are based on the process of evolution by natural selection which has been observed in nature. They essentially replicate the way in which life uses evolution to find solutions to real world problems. Surprisingly although genetic algorithms can be used to find solutions to incredibly complicated problems, they are themselves pretty simple to use and understand.

How they work

As we now know they’re based on the process of natural selection, this means they take the fundamental properties of natural selection and apply them to whatever problem it is we’re trying to solve.

The basic process for a genetic algorithm is:

1. Initialization – Create an initial population. This population is usually randomly generated and can be any desired size, from only a few individuals to thousands.
2. Evaluation – Each member of the population is then evaluated and we calculate a ‘fitness’ for that individual. The fitness value is calculated by how well it fits with our desired requirements. These requirements could be simple, ‘faster algorithms are better’, or more complex, ‘stronger materials are better but they shouldn’t be too heavy’.
3. Selection – We want to be constantly improving our populations overall fitness. Selection helps us to do this by discarding the bad designs and only keeping the best individuals in the population.  There are a few different selection methods but the basic idea is the same, make it more likely that fitter individuals will be selected for our next generation.
4. Crossover – During crossover we create new individuals by combining aspects of our selected individuals. We can think of this as mimicking how sex works in nature. The hope is that by combining certain traits from two or more individuals we will create an even ‘fitter’ offspring which will inherit the best traits from each of it’s parents.
5. Mutation – We need to add a little bit randomness into our populations’ genetics otherwise every combination of solutions we can create would be in our initial population. Mutation typically works by making very small changes at random to an individuals genome.
6. And repeat! – Now we have our next generation we can start again from step two until we reach a termination condition.

Termination

There are a few reasons why you would want to terminate your genetic algorithm from continuing it’s search for a solution. The most likely reason is that your algorithm has found a solution which is good enough and meets a predefined minimum criteria. Offer reasons for terminating could be constraints such as time or money.

A bit old, 2012, but it is a good introduction to genetic algorithms and if you read the comments (lots of those), you will find ports into multiple languages.

Important point here is to remember when presented with genetic algorithm results, be sure to ask for the fitness criteria, selection method, termination condition and the number of generations run.

Personally I would ask for the starting population and code as well.

There are any number of ways to produce an “objective” result from simply running a genetic algorithm so adopt that Heinlein adage: “Always cut cards.”

Applies in data science as it does in moon colonies.

Value of Big Data Depends on Identities in Big Data

Tuesday, September 15th, 2015

Intel Exec: Extracting Value From Big Data Remains Elusive by George Leopold.

From the post:

Intel Corp. is convinced it can sell a lot of server and storage silicon as big data takes off in the datacenter. Still, the chipmaker finds that major barriers to big data adoption remain, most especially what to do with all those zettabytes of data.

“The dirty little secret about big data is no one actually knows what to do with it,” Jason Waxman, general manager of Intel’s Cloud Platforms Group, asserted during a recent company datacenter event. Early adopters “think they know what to do with it, and they know they have to collect it because you have to have a big data strategy, of course. But when it comes to actually deriving the insight, it’s a little harder to go do.”

Put another way, industry analysts rate the difficulty of determining the value of big data as far outweighing considerations like technological complexity, integration, scaling and other infrastructure issues. Nearly two-thirds of respondents to a Gartner survey last year cited by Intel stressed they are still struggling to determine the value of big data.

“Increased investment has not led to an associated increase in organizations reporting deployed big data projects,” Gartner noted in its September 2014 big data survey. “Much of the work today revolves around strategy development and the creation of pilots and experimental projects.”

It may just be me, but “determing value,” “risk and governance,” and “integrating multiple data sources,” the top three barriers to use of big data, all depend on knowing the identities represented in big data.

The trivial data integration demos that share “customer-ID” fields, don’t inspire a lot of confidence about data integration when “customer-ID” maybe identified in as many ways as there are data sources. And that is a minor example.

It would be very hard to determine the value you can extract from data when you don’t know what the data represents, its accuracy (risk and governance), and what may be necessary to integrate it with other data sources.

More processing power from Intel is always welcome but churning poorly understood big data faster isn’t going to create value. Quite the contrary, investment in more powerful hardware isn’t going to be favorably reflected on the bottom line.

Investment in capturing the diverse identities in big data will empower easier valuation of big data, evaluation of its risks and uncovering how to integrate diverse data sources.

Capturing diverse identities won’t be easy, cheap or quick. But not capturing them will leave the value of Big Data unknown, its risks uncertain and integration a crap shoot when it is ever attempted.

Graphs in the world: Modeling systems as networks

Tuesday, September 15th, 2015

Graphs in the world: Modeling systems as networks by Russel Jurney.

From the post:

Networks of all kinds drive the modern world. You can build a network from nearly any kind of data set, which is probably why network structures characterize some aspects of most phenomenon. And yet, many people can’t see the networks underlying different systems. In this post, we’re going to survey a series of networks that model different systems in order to understand different ways networks help us understand the world around us.

We’ll explore how to see, extract, and create value with networks. We’ll look at four examples where I used networks to model different phenomenon, starting with startup ecosystems and ending in network-driven marketing.

Loaded with successful graph modeling stories Russel’s post will make you anxious to find a data set to model as a graph.

Which is a good thing.

Combining two inboxes (Russel’s and his brother’s) works because you can presume that identical email addresses belong to the same user. But what about different email addresses that belong to the same user?

For data points that will become nodes in your graph, what “properties” do you see in them that make them separate nodes? Have you captured those properties on those nodes? Ditto for relationships that will become arcs in your graph.

How easy is it for someone other than yourself to combine a graph you make with a graph made by a third person?

Data, whether represented as a graph or not, is nearly always “transparent” to its creator. Beyond modeling, the question for graphs is have you enabled transparency for others?

I first saw this in a tweet by Kirk Borne.

Most Significant Barriers to Achieving a Strong Cybersecurity Posture

Tuesday, September 15th, 2015

Cyber-Security Stat of the Day, is sponsored by Grid Cyber Sec, and is a window into cyber-security practices/thinking.

For September 14, 2015, we find Most Significant Barriers to Achieving a Strong Cybersecurity Posture:

Does the omission of “more secure software” shock you? (You know the difference between “shock” and “surprise.” Yes?)

If we keep layering buggy software on top of buggy software, then we are no smarter than most of the members of Congress who think legislation can determine behavior. It can influence it, mostly in ways not intended but determine it?

Buggy software + more buggy software = cyber insecurity.

BTW, do subscribe to Cyber-Security Stat of the Day. Sometimes funny, sometimes helpful, sometimes dismaying but its never boring.

Pope Francis: Target of FBI Terror Farce

Tuesday, September 15th, 2015

ABC News has revealed that Pope Francis, was the target of an FBI terror farce.

Melissa Chan reports in FBI arrests teen for plotting ISIS-inspired attack on Pope that:

The FBI has arrested a 15-year-old boy near Philadelphia for allegedly plotting to attack Pope Francis and unleash ISIS-inspired hell during the pontiff’s upcoming U.S. visit, it was revealed Tuesday.

The 15-year-old “obtained explosives instructions and further disseminated these instructions through social media,” according to the bulletin.

He was charged with attempting to provide material support to a terrorist organization and attempting to provide material support to terrorist activity.

His “aspirational” threats were not imminent, sources told ABC.

The drought of terrorists in the United States began on September 12, 2001 and continues to this day. The FBI has been hard pressed to find anything that even looks like potential terrorism. To the point that the FBI gins up terrorism cases by supplying support to Walter Mitty type terrorists.

While details are sketchy, the Pope Francis terror farce appears to be another one of those cases.

For example, obtaining “explosives instructions,” is certainly not a crime. You may be curious, you may want to experiment, you may want to know what to look for in terms of someone constructing explosives. All of which are perfectly innocent under the US Constitution, prior to 9/11.

Dissemination of “explosive instructions” over “social media” is also not a crime.

Well, thanks to Sen. Dianne Feinstein, also known as the Wicked Witch of the West in First Amendment circles, we did have 18 U.S. Code § 842 – Unlawful acts, which reads in part:

(p) Distribution of Information Relating to Explosives, Destructive Devices, and Weapons of Mass Destruction.—

(2)Prohibition.—It shall be unlawful for any person—

(A) to teach or demonstrate the making or use of an explosive, a destructive device, or a weapon of mass destruction, or to distribute by any means information pertaining to, in whole or in part, the manufacture or use of an explosive, destructive device, or weapon of mass destruction, with the intent that the teaching, demonstration, or information be used for, or in furtherance of, an activity that constitutes a Federal crime of violence; or

(B) to teach or demonstrate to any person the making or use of an explosive, a destructive device, or a weapon of mass destruction, or to distribute to any person, by any means, information pertaining to, in whole or in part, the manufacture or use of an explosive, destructive device, or weapon of mass destruction, knowing that such person intends to use the teaching, demonstration, or information for, or in furtherance of, an activity that constitutes a Federal crime of violence.

Considering that 18 U.S. Code § 844 – Penalties provides that:

(2) violates subsection (p)(2) of section 842, shall be fined under this title, imprisoned not more than 20 years, or both.

For completeness, 18 U.S. Code § 3571 – Sentence of fine provides the fine in such cases:

(b)Fines for Individuals.—Except as provided in subsection (e) of this section, an individual who has been found guilty of an offense may be fined not more than the greatest of—

(3) for a felony, not more than \$250,000;

Even so, distribution of explosives instructions via social media is not unlawful if:

(A) … with the intent that the teaching, demonstration, or information be used for, or in furtherance of, an activity that constitutes a Federal crime of violence; or

(B) … knowing that such person intends to use the teaching, demonstration, or information for, or in furtherance of, an activity that constitutes a Federal crime of violence.

Of course, if your website regularly features photos of government officials or others in rifle scope cross-hairs, and similar rhetoric, you may have difficulty asserting your First Amendment rights to disseminate such information.

The FBI doesn’t fare much better under the unconstitutionally broad and vague:

…material support to a terrorist organization and attempting to provide material support for terrorist activity.

(a) Prohibited Activities.—

(1)Unlawful conduct.—

Whoever knowingly provides material support or resources to a foreign terrorist organization, or attempts or conspires to do so, shall be fined under this title or imprisoned not more than 20 years, or both, and, if the death of any person results, shall be imprisoned for any term of years or for life. To violate this paragraph, a person must have knowledge that the organization is a designated terrorist organization (as defined in subsection (g)(6)), that the organization has engaged or engages in terrorist activity (as defined in section 212(a)(3)(B) of the Immigration and Nationality Act), or that the organization has engaged or engages in terrorism (as defined in section 140(d)(2) of the Foreign Relations Authorization Act, Fiscal Years 1988 and 1989).

It doesn’t take much to see where this fails.

The 15-year-old would have to:

• provide material support or resources to
• foreign terrorist organization
• knowing
• that the organization is a designated terrorist organization (as defined in subsection (g)(6))
• that the organization has engaged or engages in terrorist activity (as defined in section 212(a)(3)(B) of the Immigration and Nationality Act),
• or that the organization has engaged or engages in terrorism (as defined in section 140(d)(2) of the Foreign Relations Authorization Act, Fiscal Years 1988 and 1989).

The simple defense being, name the organization. Yes? Social media by its very nature is public and open so posting any information is hardly directed at anyone.

The government doesn’t fare much better under 18 U.S. Code § 2339A – Providing material support to terrorists, which reads in part:

(a)Offense.—

Whoever provides material support or resources or conceals or disguises the nature, location, source, or ownership of material support or resources, knowing or intending that they are to be used in preparation for, or in carrying out, a violation of section 32, 37, 81, 175, 229, 351, 831, 842(m) or (n), 844(f) or (i), 930(c), 956, 1091, 1114, 1116, 1203, 1361, 1362, 1363, 1366, 1751, 1992, 2155, 2156, 2280, 2281, 2332, 2332a, 2332b, 2332f, 2340A, or 2442 of this title, section 236 of the Atomic Energy Act of 1954 (42 U.S.C. 2284), section 46502 or 60123(b) of title 49, or any offense listed in section 2332b(g)(5)(B) (except for sections 2339A and 2339B) or in preparation for, or in carrying out, the concealment of an escape from the commission of any such violation, or attempts or conspires to do such an act, shall be fined under this title, imprisoned not more than 15 years, or both, and, if the death of any person results, shall be imprisoned for any term of years or for life. A violation of this section may be prosecuted in any Federal judicial district in which the underlying offense was committed, or in any other Federal judicial district as provided by law. (emphasis added)

You could disseminate bomb making instructions along with:

(hypothetical) meet me at the intersection of Highway 61 and Route 666 with your bomb made according to these instructions for a concerted attack on (target)

but I can’t imagine a 15-year-old, unassisted by the FBI at any rate, being that dumb.

Sen. Dianne Feinstein should voted out of office by California voters. It is difficult to imagine anyone more disconnected from national priorities than her.

Given the near non-existence of terrorism in the United States, fear of terrorism is an emotional or mental disorder, from which Senator Feinstein suffers greatly.

Fear of terrorism has resulted in a grave distortion of the government from providing services and opportunities to its citizens to cutting those services and opportunities in order to fight a fictional enemy.

If federal budget transparency is ever achieved, you will be able to list who drove and profited from that fear.

Apologies for the length but I do tire of largely fictional terror threats that fuel the fear – spend cycle in government.

I would post on the ease of real terrorist activities but then, as you know, some FBI agent would take offense at proof of the futility of their efforts and the consequences could be severe. That’s called “chilling of free speech” by the way.

Monday, September 14th, 2015

From the post:

A baby bobs up and down in a kitchen, as a Prince song plays in the background. His mother laughs in the background and his older sister zooms in and out of the frame.

This innocuous 29-second home video clip was posted to YouTube in 2007 and sparked a long legal proceeding on copyright and fair use law.

In the case, Lenz v. Universal — which has gained notoriety as the “dancing baby” lawsuit — Universal Music Group sent YouTube a warning to take the video down, claiming copyright infringement under the Digital Millennium Copyright Act. Then, Stephanie Lenz, poster of the video and mother of the baby, represented by Electronic Frontier Foundation, sued Universal for wrongly targeting lawful fair use.

Today, eight years later, a federal appeals court has sided with the dancing baby.

If you need more legal background on the issues, consider the EFF page on Lenz v. Universal (links to original court documents), or the Digital Media Law page, Universal Music v. Lenz.

The DMCA (Digital Millennium Copyright Act) should be amended to presume fair use unless and until the complaining party convince a court that someone else is profiting from the use of their property. No profit, no foul. No more non-judicial demand for take downs of any content.

Getting started with open source machine learning

Monday, September 14th, 2015

From the post:

Despite all the flashy headlines from Musk and Hawking on the impending doom to be visited on us mere mortals by killer robots from the skies, machine learning and artificial intelligence are here to stay. More importantly, machine learning (ML) is quickly becoming a critical skill for developers to enhance their applications and their careers, better understand data, and to help users be more effective.

What is machine learning? It is the use of both historical and current data to make predictions, organize content, and learn patterns about data without being explicitly programmed to do so. This is typically done using statistical techniques that look for significant events like co-occurrences and anomalies in the data and then factoring in their likelihood into a model that is queried at a later time to provide a prediction for some new piece of data.

Common machine learning tasks include classification (applying labels to items), clustering (grouping items automatically), and topic detection. It is also commonly used in natural language processing. Machine learning is increasingly being used in a wide variety of use cases, including content recommendation, fraud detection, image analysis and ecommerce. It is useful across many industries and most popular programming languages have at least one open source library implementing common ML techniques.

Reflecting the broader push in software towards open source, there are now many vibrant machine learning projects available to experiment with as well as a plethora of books, articles, tutorials, and videos to get you up to speed. Let’s look at a few projects leading the way in open source machine learning and a few primers on related ML terminology and techniques.

Grant rounds up a starting list of primers and projects if you need an introduction to machine learning.

Enjoy!

Data Science from Scratch

Monday, September 14th, 2015

Data Science from Scratch by Joel Grus.

Joel provides a whirlwind tour of Python that is part of the employee orientation at DataSciencester. Not everything you need to know about Python but a good sketch of why it is important to data scientists.

I first saw this in a tweet by Kirk Borne.

Open Data: Big Benefits, 7 V’s, and Thousands of Repositories [But Who Pays?]

Sunday, September 13th, 2015

From the post:

Open data repositories are fantastic for many reasons, including: (1) they provide a source of insight and transparency into the domains and organizations that are represented by the data sets; (2) they enable value creation across a variety of domains, using the data as the “fuel” for innovation, government transformation, new ideas, and new businesses; (3) they offer a rich variety of data sets for data scientists to sharpen their data mining, knowledge discovery, and machine learning modeling skills; (4) they allow many more eyes to look at the data and thereby to see things that might have been missed by the creators and original users of the data; and (5) they enable numerous “data for social good” activities (hackathons, citizen-focused innovations, public development efforts, and more).

The following seven V’s represent characteristics and challenges of open data:

1. Validity: data quality, proper documentation, and data usefulness are always an imperative, but it is even more critical to pay attention to these data validity concerns when your organization’s data are exposed to scrutiny and inspection by others.
2. Value: new ideas, new businesses, and innovations can arise from the insights and trends that are found in open data, thereby creating new value both internal and external to the organization.
3. Variety: the number of data types, formats, and schema are as varied as the number of organizations who collect data. Exposing this enormous variety to the world is a scary proposition for any data scientist.
5. Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use. Search, discovery, and proper reuse of data all require good metadata, descriptions, and data modeling.
6. Vulnerability: the frequency of data theft and hacking incidents has increased dramatically in recent years — and this is for data that are well protected. The likelihood that your data will be compromised is even greater when the data are released “into the wild”. Open data are therefore much more vulnerable to misuse, abuse, manipulation, or alteration.
7. proVenance (okay, this is a “V” in the middle, but provenance is absolutely central to data curation and validity, especially for Open Data): maintaining a formal permanent record of the lineage of open data is essential for its proper use and understanding. Provenance includes ownership, origin, chain of custody, transformations that been made to it, processing that has been applied to it (including which versions of processing software were used), the data’s uses and their context, and more.

Open Data has many benefits when the 7 V’s are answered!

Kirk doesn’t address who pay the cost of the 7 V’s being answered.

The most obvious one for topic maps:

#5 Vocabulary: the semantics and schema (data models) that describe your data are more critical than ever when you provide the data for others to use….

Yes, “…when you provide the data for others to use.” If I can use my data without documenting the semantics and schema (data models), who covers the cost of my creating that documentation and schemas?

In any sufficiently large enterprise, when you ask for assistance, the response will ask for the contract number to which the assistance should be billed.

If you know your Heinlein, then you know the acronym TANSTaaFL (“There ain’t no such thing as a free lunch”) and its application here is obvious.

Or should I say its application is obvious from the repeated calls for better documentation and models and the continued absence of the same?

Who do you think should be paying for better documentation and data models?

Posts from 140 #DataScience Blogs

Sunday, September 13th, 2015

Kirk Borne posted a link to:

Recent posts from 150+ #DataScience Blogs worldwide, curated by @dsguidebiz http://dsguide.biz/reader/ #BigData #Analytics

By count of the sources listed on http://dsguide.biz/reader/sources, the number of sources is 140, as of September 13, 2015.

A wealth of posts and videos!

Everyone who takes advantage of this listing, however, will have to go through the same lists of posts by category.

That repetition, even with searching, seems like a giant time sink to me.

You?

Big Data Never Sleeps 3.0

Saturday, September 12th, 2015

Kirk Borne posted this to twitter:

Now, ask yourself how much of that data is relevant to any query you made yesterday? Or within the last week?

There are some legitimately large data sets, genomic, astronomical, oceanography, Large Hadron collider data and so many more.

The analysis of some big data sets require the processing of the entire data set but even with the largest data sets, say astronomical data sets, you may only be interested in a small portion of data for heavy analysis.

The overall amount of data keeps increasing to be sure, making the skill of selecting the right data for analysis all the more important.

The size of your data set matters far less than the importance of your results.

Let’s see a list in 2016 of the most important results from data analysis, skipping the size of the data sets as a qualifier.

Statistical Analysis Model Catalogs the Universe

Friday, September 11th, 2015

Statistical Analysis Model Catalogs the Universe by Kathy Kincade.

From the post:

The roots of tradition run deep in astronomy. From Galileo and Copernicus to Hubble and Hawking, scientists and philosophers have been pondering the mysteries of the universe for centuries, scanning the sky with methods and models that, for the most part, haven’t changed much until the last two decades.

Now a Berkeley Lab-based research collaboration of astrophysicists, statisticians and computer scientists is looking to shake things up with Celeste, a new statistical analysis model designed to enhance one of modern astronomy’s most time-tested tools: sky surveys.

A central component of an astronomer’s daily activities, surveys are used to map and catalog regions of the sky, fuel statistical studies of large numbers of objects and enable interesting or rare objects to be studied in greater detail. But the ways in which image datasets from these surveys are analyzed today remains stuck in, well, the Dark Ages.

“There are very traditional approaches to doing astronomical surveys that date back to the photographic plate,” said David Schlegel, an astrophysicist at Lawrence Berkeley National Laboratory and principal investigator on the Baryon Oscillation Spectroscopic Survey (BOSS, part of SDSS) and co-PI on the DECam Legacy Survey (DECaLS). “A lot of the terminology dates back to that as well. For example, we still talk about having a plate and comparing plates, when obviously we’ve moved way beyond that.”

Surprisingly, the first electronic survey — the Sloan Digital Sky Survey (SDSS) — only began capturing data in 1998. And while today there are multiple surveys and high-resolution instrumentation operating 24/7 worldwide and collecting hundreds of terabytes of image data annually, the ability of scientists from multiple facilities to easily access and share this data remains elusive. In addition, practices originating a hundred years ago or more continue to proliferate in astronomy — from the habit of approaching each survey image analysis as though it were the first time they’ve looked at the sky to antiquated terminology such as “magnitude system” and “sexagesimal” that can leave potential collaborators outside of astronomy scratching their heads.

It’s conventions like these in a field he loves that frustrate Schlegel.

Does 500 terabytes strike you as “big data?”

The Celeste project described by Kathy in her post and in greater detail in: Celeste: Variational inference for a generative model of astronomical images by Jeff Regier, et al., is an attempt to change how optical telescope image sets are thought about and processed. It’s initial project, sky surveys, will involve 500 terabytes of data.

Given the wealth of historical astronomical terminology, such as magnitude, the opportunities for mapping to new techniques and terminologies will abound. (Think topic maps.)

Capturing Quotes from Video

Friday, September 11th, 2015

From the post:

“As I went through these articles and came across a text quote, I kept thinking, ‘Why can’t I just click on it and see the corresponding part of the video and get the full experience of how they said it?’”

Surfacing the latest Donald Trump gem from a long, rambling video to share it in a story can be a chore. A new tool from The Times of London called quickQuote, recently open sourced, allows users to upload a video, search for and select words and sentences from an automatically generated transcription of that video, and then embed the chosen quote with the accompanying video excerpt into any article.

Users can then highlight a quote they want to use, edit the quote on the same page to correct for any errors made by the automated transcription service, and then preview and export an embeddable quote/video clip package.

At Github: http://times.github.io/quickQuote/

This is awesome!

I know what I am going to be doing this weekend!

Friday, September 11th, 2015

From the post:

In July, the Kilton Public Library in Lebanon, New Hampshire, was the first library in the country to become part of the anonymous Web surfing service Tor. The library allowed Tor users around the world to bounce their Internet traffic through the library, thus masking users’ locations.

Soon after state authorities received an email about it from an agent at the Department of Homeland Security.

“The Department of Homeland Security got in touch with our Police Department,” said Sean Fleming, the library director of the Lebanon Public Libraries.

After a meeting at which local police and city officials discussed how Tor could be exploited by criminals, the library pulled the plug on the project.

“Right now we’re on pause,” said Fleming. “We really weren’t anticipating that there would be any controversy at all.”

He said that the library board of trustees will vote on whether to turn the service back on at its meeting on Sept. 15.

See Julia’s post for the details but this was just the first library in what was planned to be a series of public libraries across the United States offering Tor. An article about that plan in ArsTechnica tipped off law enforcement before nationwide Tor services could be established.

The public statements by law enforcement sound reasonable, need all the issues on the table, etc., but make no mistake, this is an effort to cripple making the Tor service far more effective than it is today.

There isn’t any middle ground where citizens can have privacy and yet criminals can be prevented from having privacy. After all, unless and until you are convicted in a court of law, you are a citizen, not a criminal.

There is a certain cost to the presumption of innocence and that cost has been present since the Constiution was adopted. Guilty people may go free or perhaps not even be caught because of your rights under the U.S. Constitution.

If you are in Lebanon, New Hampshire, attend the library supervisor’s meeting and voice support for Tor!

If you can’t make the meeting, ask your library for Tor. (See the ArsTechnica post for more details on the project.)

Corpus of American Tract Society Publications

Friday, September 11th, 2015

Corpus of American Tract Society Publications by Lincoln Mullen.

From the post:

I’ve created a small to mid-sized corpus of publications by the American Tract Society up to the year 1900 in plain text. This corpus has been gathered from the Internet Archive. It includes 641 documents containing just under sixty million words, along with a CSV file containing metadata for each of the files. I don’t make any claims that this includes all of the ATS publications from that time period, and it is pretty obvious that the metadata from the Internet Archive is not much good. The titles are mostly correct; the dates are pretty far off in cases.

This corpus was created for the purpose of testing document similarity and text reuse algorithms. I need a corpus for testing the textreuse, which is in very early stages of development. From reading many, many of these tracts, I already know the patterns of text reuse. (And of course, the documents are historically interesting in their own right, and might be a good candidate for text mining.) The ATS frequently republished tracts under the same title. Furthermore, they published volumes containing the entire series of tracts that they had published individually. So there are examples of entire documents which are reprinted, but also some documents which are reprinted inside others. Then as a extra wrinkle, the corpus contains the editions of the Bible published by the ATS, plus their edition of Cruden’s concordance and a Bible dictionary. Likely all of the tracts quote the Bible, some at great length, so there are many examples of borrowing there.

Here is the corpus and its repository:

With the described repetition, the corpus must compress well. 😉

Makes me wonder how much near-repetition occurs in CS papers?

Graph papers than repeat graph fundamentals, in nearly the same order, in paper after paper.

At what level would you measure re-use? Sentence? Paragraph? Larger divisions?

Clojure Remote – Coming February 2016

Friday, September 11th, 2015

Clojure Remote – Coming February 2016

From the webpage:

Clojure Remote will be Clojure’s first exclusively remote conference. While I firm up the details, sign up to get news as it happens, and take an opportunity to provide any feedback you have on how you’d like to see the conference run.

Sounds interesting!

I’ve signed up for more details as they arrive.

You?

CIA to Release Declassified President’s Daily Brief Articles

Thursday, September 10th, 2015

CIA to Release Declassified President’s Daily Brief Articles

From the post:

Previously classified President’s Daily Brief (PDB) articles from the John F. Kennedy and Lyndon B. Johnson administrations produced by CIA are scheduled to be released on Wednesday, September 16 at the LBJ Library in Austin, Texas, at a public symposium entitled The President’s Daily Brief: Delivering Intelligence to the First Customer. The event will be livestreamed by the LBJ Library via their website http://www.lbjlibrary.org/events/cia-sept16/

CIA Director John O. Brennan will present the event’s keynote speech and Director of National Intelligence James R. Clapper will deliver closing remarks. In addition, the event will feature a panel discussion and remarks by other leaders from the academic, archivist, and intelligence communities, including William H. McRaven, Chancellor of the University of Texas System, former CIA Director Porter Goss, former CIA Deputy Director Bobby Inman, and others.

The President’s Daily Brief (PDB) contains intelligence analysis on key national security issues for the President and other senior policymakers. Only the President, the Vice President, and a select group of officials designated by the President receive the briefing, which represents the Intelligence Community’s best insights on issues the President must confront when dealing with threats as well as opportunities related to our national security.

This public release highlights the role of the PDB in foreign and national security policy making. This collection includes the President’s Intelligence Checklists (PICLs) — which preceded the PDB — published from June 1961 to November 1964, and the PDBs published from December 1964 through the end of President Johnson’s term in January 1969. These documents offer insight on intelligence that informed presidential decisions during critical historical events such as: the Cuban Missile Crisis, the 1967 Six-Day War, the 1968 Soviet invasion of Czechoslovakia, and Vietnam.

The documents will be posted on the CIA website the day of the symposium at http://www.foia.cia.gov. This collection was assembled as part of the CIA’s Historical Review Program, which identifies, reviews, and declassifies documents on historically significant events or topics. Previous releases can be viewed at: http://www.foia.cia.gov/historical-collections.

Only forty-six (46) years too late to hold anyone responsible for decisions made on the basis of “information” withheld from the voting public.

Comparing the “facts as known by the President” and the “facts as reported to the American people” will require a cast of thousands, but will be well worth the effort.

Albeit dated, a comparison of the public and private record should establish that continuing secrecy serves only those who wish to manipulate decisions of a democratic state on the basis of secret information.

Those who insist on the type of secrecy that conceals corruption and incompetency, should be given free transport to any place outside the continental United States, their U.S. passports revoked and left so they can enjoy life under non-democratic governments.

Government secrecy is the antithesis of democracy and should be seen as a grave and direct threat to even an appearance of democratic processes.

PS: How are you going to line up the “facts” in these daily briefings with “facts” as reported by the White House to the public? Needs to be simple, auditable and fast.

14 innovative journalism courses…

Thursday, September 10th, 2015

14 innovative journalism courses to follow this Fall by Aleszu Bajak.

From the post:

With classes back in session, we wanted to highlight a few forward-looking courses being taught at journalism schools across the country. But first, to introduce these syllabi, we recommend “Those Who Do, Also Teach: David Carr’s Gift to Journalism Schools,” by Molly Wright Steenson for Storybench. It’s a look at Carr’s inspiring syllabus, Press Play, and why it resonates today more than ever. Below, an excerpt from Carr’s syllabus:

While writing, shooting, and editing are often solitary activities, great work emerges in the spaces between people. We will be working in groups with peer and teacher edits. There will be a number of smaller assignments, but the goal is that you will leave here with a single piece of work that reflects your capabilities as a maker of media. But remember, evaluations will be based not just on your efforts, but on your ability to bring excellence out of the people around you.

So take a look at the following J-school courses. Check out Robert Hernandez’s experiments in VR journalism, Molly Wright Steenson’s exploration of information architecture and the media landscape, Dan Nguyen’s data reporting class, Catherine D’Ignazio’s projects melding civic art and design, and our own Jeff Howe’s media innovation studio at Northeastern University, among many others. You don’t have to be a journalism student to dig deep into the readings, try out some assignments, and learn something new.

If you get a thrill from discovering new information or mastering a new skill, you will have a field day with this course listing.

You want to follow Storybench, which self-describes as:

Storybench is a collaboration between the Media Innovation track at Northeastern University’s School of Journalism and Esquire magazine.

At Storybench, we want to reinvigorate and reimagine what digital journalism can be. This means providing an “under the hood” look at the latest and most inventive examples of digital creativity—from data visualization projects to interactive documentaries—as well as the tools and innovators behind them.

Whether you are a veteran newsroom editor, web designer, budding coder or journalism student, Storybench will help you learn what is being built and how, so you can find your way to what might be built next.

Storybench‘s editor is Aleszu Bajak, a science journalist and former Knight Science Journalism Fellow at MIT. He is an alum of Science Friday, the founder of LatinAmericanScience.org and is passionate about breaking down the divide between journalists, developers and designers. He can be reached at aleszubajak [at] gmail or at aleszu.com.

Enjoy!