What Does Probability Mean in Your Profession? [Divergences in Meaning]

September 27th, 2015

What Does Probability Mean in Your Profession? by Ben Orlin.

Impressive drawings that illustrate the divergence in meaning of “probability” for various professions.

I’m not sold on the “actual meaning” drawing because if everyone in a discipline understands “probability” to mean something else, on what basis can you argue for the “actual meaning?”

If I am reading a paper by someone who subscribes to a different meaning than your claimed “actual” one, then I am going to reach erroneous conclusions about their paper. Yes?

That is in order to understand a paper I have to understand the words as they are being used by the author. Yes?

If I understand “democracy and freedom” to mean “serves the interest of U.S.-based multinational corporations,” then calls for “democracy and freedom” in other countries isn’t going to impress me all that much.

Enjoy the drawings!

1150 Free Online Courses from Top Universities (update) [Collating Content]

September 27th, 2015

1150 Free Online Courses from Top Universities (update).

From the webpage:

Get 1150 free online courses from the world’s leading universities — Stanford, Yale, MIT, Harvard, Berkeley, Oxford and more. You can download these audio & video courses (often from iTunes, YouTube, or university web sites) straight to your computer or mp3 player. Over 30,000 hours of free audio & video lectures, await you now.

An ever improving resource!

As of last January (2015), it listed 1100 courses.

Another fifty courses have been added and I discovered a course in Hittite!

The same problem with collating content across resources that I mentioned for data science books, obtains here as you take courses in the same discipline or read primary/secondary literature.

What if I find references that are helpful in the Hittite course in the image PDFs of the Chicago Assyrian Dictionary? How do I combine that with the information from the Hittite course so if you take Hittite, you don’t have to duplicate my search?

That’s the ticket isn’t it? Not having different users performing the same task over and over again? One user finds the answer and for all other users, it is simply “there.”

Quite a different view of the world of information than the repetitive, non-productive, ad-laden and often irrelevant results from the typical search engine.

The World’s First $9 Computer is Shipping Today!

September 27th, 2015

The World’s First $9 Computer is Shipping Today! by The World’s First $9 Computer is Shipping Today!Khyati Jain.

From the post:

Remember Project: C.H.I.P. ?

A $9 Linux-based, super-cheap computer that raised some $2 Million beyond a pledge goal of just $50,000 on Kickstarter will be soon in your pockets.

Four months ago, Dave Rauchwerk, CEO of Next Thing Co., utilized the global crowd-funding corporation ‘Kickstarter’ for backing his project C.H.I.P., a fully functioning computer that offers more than what you could expect for just $9.

See Khyati’s post for technical specifications.

Security by secrecy is meaningless when potential hackers (14-64) number 4.8 billion.

With enough hackers, all bugs can be found.

Writing “Python Machine Learning”

September 26th, 2015

Writing “Python Machine Learning” by Sebastian Raschka.

From the post:

It’s been about time. I am happy to announce that “Python Machine Learning” was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing “Python Machine Learning” really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.

A delightful tale for those of us who have authored books and an inspiration (with some practical suggestions) for anyone who hopes to write a book.

Sebastian’s productivity hints will ring familiar for those with similar habits and bear study by those who hope to become more productive.

Sebastian never comes out and says it but his writing approach breaks each stage of the book into manageable portions. It is far easier to say (and do) “write an outline” than to “write the complete and fixed outline for an almost 500 page book.”

If the task is too large, the complete and immutable outline, you won’t get up enough momentum to make a reasonable start.

After reading Sebastian’s post, what book are you thinking about writing?

Free Data Science Books (Update, + 53 books, 117 total)

September 26th, 2015

Free Data Science Books (Update).

From the post:

Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.

While every single book in this list is provided for free, if you find any particularly helpful consider purchasing the printed version. The authors spent a great deal of time putting these resources together and I’m sure they would all appreciate the support!

Note: Updated books as of 9/21/15 are post-fixed with an asterisk (*). Scroll to updates

Great news but also more content.

Unlike big data, you have to read this content in detail to obtain any benefit from it.

And books in the same area are going to have overlapping content as well as some unique content.

Imagine how useful it would be to compose a free standing work with the “best” parts from several works.

Copyright laws would be a larger barrier but no more than if you cut-n-pasted your own version for personal use.

If such an approach could be made easy enough, the resulting value would drown out dissenting voices.

I think PDF is the principal practical barrier.

Do you suspect others?

I first saw this in a tweet by Kirk Borne.

Data Science Glossary

September 26th, 2015

Data Science Glossary by Bob DuCharme.

From the about page:

Terms included in this glossary are the kind that typically come up in data science discussions and job postings. Most are from the worlds of statistics, machine learning, and software development. A Wikipedia entry icon links to the corresponding Wikipedia entry, although these are often quite technical. Email corrections and suggestions to bob at this domain name.

Is your favorite term included?

You can follow Bob on Twitter @bobdc.

Or read his blog at: bobdc.blog.

Thanks Bob!

Attention Law Students: You Can Change the Way People Interact with the Law…

September 25th, 2015

Attention Law Students: You Can Change the Way People Interact with the Law…Even Without a J.D. by Katherine Anton.

From the post:

A lot of people go to law school hoping to change the world and make their mark on the legal field. What if we told you that you could accomplish that, even as a 1L?

Today we’re launching the WeCite contest: an opportunity for law students to become major trailblazers in the legal field. WeCite is a community effort to explain the relationship between judicial cases, and will be a driving force behind making the law free and understandable.

To get involved, all you have to do is go to http://www.casetext.com/wecite and choose the treatment that best describes a newer case’s relationship with an older case. Law student contributors, as well as the top contributing schools, will be recognized and rewarded for their contributions to WeCite.

Read on to learn why WeCite will quickly become your new favorite pastime and how to get started!

Shepard’s Citations began publication in 1873 and by modern times, had such an insurmountable lead, that the cost of creating a competing service were a barrier to anyone else entering the field.

To be useful to lawyers, a citation index can’t index some of the citations but all of the citations.

The WeCite project, based on crowd-sourcing, is poised to demonstrate creation of a public law citation index is doable.

While the present project is focused on law students, I am hopeful that the project opens up for contributions from more senior survivors of law school, practicing or not.

Three Reasons You May Not Want to Learn Clojure [One of these reasons applies to XQuery]

September 25th, 2015

Three Reasons You May Not Want to Learn Clojure by Mark Bastian.

From the post:

I’ve been coding in Clojure for over a year now and not everything is unicorns and rainbows. Whatever language you choose will affect how you think and work and Clojure is no different. Read on to see what some of these effects are and why you might want to reconsider learning Clojure.

If you are already coding in Clojure, you will find this amusing.

If you are not already coding in Clojure, you may find this compelling.

I won’t say which one of these reasons applies to XQuery, at least not today. Watch this blog on Monday of next week.

Apache Lucene 5.3.1, Solr 5.3.1 Available

September 24th, 2015

Apache Lucene 5.3.1, Solr 5.3.1 Available

From the post:

The Lucene PMC is pleased to announce the release of Apache Lucene 5.3.1 and Apache Solr 5.3.1

Lucene can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/java/5.3.1
and Solr can be downloaded from http://www.apache.org/dyn/closer.lua/lucene/solr/5.3.1

Highlights of this Lucene release include:

Bug Fixes

  • Remove classloader hack in MorfologikFilter
  • UsageTrackingQueryCachingPolicy no longer caches trivial queries like MatchAllDocsQuery
  • Fixed BoostingQuery to rewrite wrapped queries

Highlights of this Solr release include:

Bug Fixes

  • security.json is not loaded on server start
  • RuleBasedAuthorization plugin does not work for the collection-admin-edit permission
  • VelocityResponseWriter template encoding issue. Templates must be UTF-8 encoded
  • SimplePostTool (also bin/post) -filetypes “*” now works properly in ‘web’ mode
  • example/files update-script.js to be Java 7 and 8 compatible.
  • SolrJ could not make requests to handlers with ‘/admin/’ prefix
  • Use of timeAllowed can cause incomplete filters to be cached and incorrect results to be returned on subsequent requests
  • VelocityResponseWriter’s $resource.get(key,baseName,locale) to use specified locale.
  • Resolve XSS issue in Admin UI stats page

Time to upgrade!


Data Analysis for the Life Sciences… [No Ur-Data Analysis Book?]

September 24th, 2015

Data Analysis for the Life Sciences – a book completely written in R markdown by Rafael Irizarry.

From the post:

Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges. We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

Have you ever wondered about the growing proliferation of data analysis books?

The absence of one Ur-Data Analysis book that everyone could read and use?

I have a longer post coming on a this idea but if each discipline has the need for its own view on data analysis, it is really surprising that no one system of semantics satisfies all communities?

In other words, is the evidence of heterogeneous semantics so strong that we should abandon attempts at uniform semantics and focus on communicating across systems of semantics?

I’m sure there are other examples of where every niche has its own vocabulary, tables in relational databases or column headers in spreadsheets for example.

What is your favorite example of heterogeneous semantics?

Assuming heterogeneous semantics are here to stay (they have been around since the start of human to human communication, possibly earlier), what solution do you suggest?

I first saw this in a tweet by Christophe Lalanne.

Guesstimating the Future

September 24th, 2015

I ran across some introductory slides on Neo4j with the line:

Forrester estimates that over 25% of enterprises will be using graph databases by 2017.

Well, Forrester also predicted that tablet sales would over take laptops sales in 2015: Forrester: Tablet Sales Will Eclipse Laptop Sales by 2015.

You might want to check that prediction against: Laptop sales ‘stronger than ever’ versus tablets – PCR Retail Advisory Board.

The adage “It is difficult to make predictions, especially about the future.,” remains appropriate.

Neo4j doesn’t need lemming-like behavior among consumers of technology to make a case for itself.

Compare Neo4j and its query language, Cypher, to your use cases and I think you will agree.

A review of learning vector quantization classifiers

September 23rd, 2015

A review of learning vector quantization classifiers by David Nova, Pablo A. Estevez.


In this work we present a review of the state of the art of Learning Vector Quantization (LVQ) classifiers. A taxonomy is proposed which integrates the most relevant LVQ approaches to date. The main concepts associated with modern LVQ approaches are defined. A comparison is made among eleven LVQ classifiers using one real-world and two artificial datasets.

From the introduction:

Learning Vector Quantization (LVQ) is a family of algorithms for statistical pattern classification, which aims at learning prototypes (codebook vectors) representing class regions. The class regions are defined by hyperplanes between prototypes, yielding Voronoi partitions. In the late 80’s Teuvo Kohonen introduced the algorithm LVQ1 [36, 38], and over the years produced several variants. Since their inception LVQ algorithms have been researched by a small but active community. A search on the ISI Web of Science in November, 2013, found 665 journal articles with the keywords “Learning Vector Quantization” or “LVQ” in their titles or abstracts. This paper is a review of the progress made in the field during the last 25 years.

Heavy sledding but if you want to review the development of a classification algorithm with a manageable history, this is a likely place to start.


Fast k-NN search

September 23rd, 2015

Fast k-NN search by Ville Hyvönen, Teemu Pitkänen, Sotiris Tasoulis, Liang Wang, Teemu Roos, Jukka Corander.


Random projection trees have proven to be effective for approximate nearest neighbor searches in high dimensional spaces where conventional methods are not applicable due to excessive usage of memory and computational time. We show that building multiple trees on the same data can improve the performance even further, without significantly increasing the total computational cost of queries when executed in a modern parallel computing environment. Our experiments identify suitable parameter values to achieve accurate searches with extremely fast query times, while also retaining a feasible complexity for index construction.

Not a quick read but an important one if you want to use multiple dimensions for calculation of similarity or sameness between two or more topics.

The technique requires you to choose a degree of similarity that works for your use case.

This paper makes a nice jumping off point for discussing how much precision does a particular topic map application need? Absolute precision is possible, but only in a limited number of cases and I suspect at high cost.

For some use cases, such as searching for possible suspects in crimes, some lack of precision is necessary to build up a large enough pool of suspects to include the guilty parties.

Any examples of precision and topic maps that come to mind?


September 23rd, 2015


From the about page:

SymbolHound is a search engine that doesn’t ignore special characters. This means you can easily search for symbols like &, %, and π. We hope SymbolHound will help programmers find information about their chosen languages and frameworks more easily.

SymbolHound was started by David Crane and Thomas Feldtmose while students at the University of Connecticut. Another project by them is Toned Ear, a website for ear training.

I first saw SymbolHound mentioned in a discussion of delimiter options for a future XQuery feature.

For syntax drafting you need to have SymbolHound on your toolbar, not just bookmarked.

Government Travel Cards at Casinos or Adult Entertainment Establishments

September 23rd, 2015

Audit of DoD Cardholders Who Used Government Travel Cards at Casinos or Adult Entertainment Establishments by Michael J. Roark, Assistant Inspector General, Contract Management and Payments, Department of Defense.

From the memorandum:

We plan to begin the subject audit in September 2015. The Senate Armed Services Committee requested this audit as a follow-on review of transactions identified in Report No. DODIG-2015-125, “DoD Cardholders Used Their Government Travel Cards for Personal Use at Casinos and Adult Entertainment Establishments,” May 19, 2015. Our objective is to determine whether DoD cardholders who used government travel cards at casinos and adult entertainment establishments for personal use sought or received reimbursement for the charges. In addition, we will determine whether disciplinary actions have been taken in cases of personal use and if the misuse was repo1ted to the appropriate security office. We will consider suggestions from management on additional or revised objectives.

This project is a follow up to: Report No. DODIG-2015-125, “DoD Cardholders Used Their Government Travel Cards for Personal Use at Casinos and Adult Entertainment Establishments” (May 19, 2015), which summarizes its findings as:

We are providing this report for your review and comment. We considered management comments on a draft of this report when preparing the final report. DoD cardholders improperly used their Government Travel Charge Card for personal use at casinos and adult entertainment establishments. From July 1, 2013, through June 30, 2014, DoD cardholders had 4,437 transactions totaling $952,258, where they likely used their travel cards at casinos for personal use and had 900 additional transactions for $96,576 at adult entertainment establishments. We conducted this audit in accordance with generally accepted government auditing standards.

Let me highlight that for you:

July 1, 2013 through June 30, 2014, DoD cardholders:

4,437 transactions at casinos for $952,258

900 transactions at adult entertainment establishments for $96,576

Are lap dances that cheap? ;-)

Almost no one goes to a casino or adult entertainment establishment alone, so topic maps would be a perfect fit for finding “associations” between DoD personnel.

The current project is to track the outcome of the earlier report, that is what if any actions resulted.

What do you think?

Will the DoD personnel claim they were doing off the record surveillance of suspected information leaks? Or just checking their resistance to temptation?

Before I forget, here is the breakdown by service (from the May 19, 2015 report, page 6):


I don’t know what to make up the distribution of “adult transactions” between the services.


5.6 Million Fingerprints Stolen in OPM Hack [Still No Competence or Transparency]

September 23rd, 2015

5.6 Million Fingerprints Stolen in OPM Hack by Chris Brook.

The management follies continue at the Office of Personnel Management (OPM), which I mentioned the other day had declined to use modern project management practices.

A snippet from Chris’ post, which you should read in it entirety:

OPM said at the beginning of September that it would begin sending letters to victims of the breach “in a few weeks,” yet the agency’s recent statement reiterates that an interagency team is still working in tandem with the Department of Defense to prep the letters.

“An interagency team will continue to analyze and refine the data as it prepares to mail notification letters to impacted individuals,” Schumach wrote.

Did you read between the lines to intuit the cause of the delay in letter preparation?

The next big shoe to drop, either on prodding by Congress or news media:

The Office of Personnel Management doesn’t have current addresses on all 21.5 million government workers.

When a data breach occurs at a major bank, credit card company, etc., sending the breach letter is a matter of composing it and hiring a mail house to do the mailing.

This is going on four months after OPM admitted the hack and still no letters?

I may be over estimating the competency of OPM management when it comes to letter writing but my bet would be on a lack of current addresses for a large portion of the employees impacted.

FYI, hiring former OPM staff has a name. It’s called assumption of risk.

Sharing Economy – Repeating the Myth of Code Reuse

September 23rd, 2015

Bitcoin and sharing economy pave the way for new ‘digital state’ by Sophie Curtis.

Sophie quotes Cabinet Office minister Matthew Hancock MP as saying:

For example, he said that Cloud Foundry, a Californian company that provides platform-as-a-service technology, could help to create a code library for digital public services, helping governments around the world to share their software.

“Governments across the world need the same sorts of services for their citizens, and if we write them in an open source way, there’s no need to start from scratch each time,” he said.

“So if the Estonian government writes a program for licensing, and we do loads of licensing in the UK, it means we’ll be able to pull that code down and build the technology cheaper. Local governments deliver loads of services too and they can base their services on the same platforms.”

However, he emphasised that this is about sharing programs, code and techniques – not about sharing data. Citizens’ personal data will remain the responsibility of the government in question, and will not be shared across borders, he said.

I’m guess that “The Rt Hon Matt Hancock MP” hasn’t read:

The code reuse myth: why internal software reuse initiatives tend to fail by Ben Morris

The largest single barrier to effective code reuse is that it is difficult. It doesn’t just happen by accident. Reusable code has to be specifically designed for a generalised purpose and it is unlikely to appear spontaneously as a natural by-product of development projects.

Reusable components are usually designed to serve an abstract purpose and this absence of tangible requirements can make them unusually difficult to design, develop and test. Their development requires specific skills and knowledge of design patterns that is not commonly found in development teams. Developing for reuse is an art in itself and it takes experience to get the level of abstraction right without making components too specific for general use or too generalised to add any real value.

These design challenges can be exasperated by organisational issues in larger and more diffused development environments. If you are going to develop common components then you will need a very deep understanding of a range of constantly evolving requirements. As the number of projects and teams involved in reuse grow it can be increasingly difficult to keep track of these and assert any meaningful consistency.

Successful code reuse needs continuous effort to evolve shared assets in step with the changing business and technical landscape. This demands ownership and governance to ensure that assets don’t fall into disrepair after the initial burst of effort that established them. It also requires a mature development process that provides sufficient time to design, test, maintain and enhance reusable assets. Above all, you need a team of skilled architects and developers who are sufficiently motivated and empowered to take a lead in implementing code reuse.

Reuse Myth – can you afford reusable code? by Allan Kelly

In my Agile Business Conference present (“How much quality can we afford?”) I talked about the Reuse Myth, this is something always touch on when I deliver a training course but I’ve never taken time to write it down. Until now.

Lets take as our starting point Kevlin Henney’s observation that “there is no such thing as reusable code, only code that is reused.” Kevlin (given the opportunity!) goes on to examine what constitutes “reuse” over simple “use.” A good discussion itself but right now the I want to suggest that an awful lot of code which is “designed for reuse” is never actually re-used.

In effect that design effort is over engineering, waste in other words. One of the reasons developers want to “design for reuse” is not so much because the code will be reused but rather because they desire a set of properties (modularity, high cohesion, low coupling, etc.) which are desirable engineering properties but sound a bit abstract.

In other words, striving for “re-usability” is a developers way of striving for well engineered code. Unfortunately in striving for re-usability we loose focus which brings us to the second consideration…. cost of re-usability.

In Mythical Man Month (1974) Fred Brooks suggests that re-usable code costs three times as much to develop as single use code. I haven’t seen any better estimates so I tend to go with this one. (If anyone has any better estimates please send them over.)

Think about this. This means that you have to use your “reusable&#82#8221; code three times before you break even. And it means you only see a profit (saving) on the fourth reuse.

How much code which is built for reuse is reused four times?

Those are two “hits” out of 393,000 that I got this afternoon searching on (with the quotes) “code reuse.”

Let’s take The Rt Hon Matt Hancock MP statement and re-write it a bit:

Hypothetical Statement – Not an actual statement by The Rt Hon Matt Hancock MP, he’s not that well informed:

“So if the Estonian government spends three (3) times as much to write a program for licensing, and we do loads of licensing in the UK, it means we’ll be able to pull that code down and build the technology cheaper. Local governments deliver loads of services too and they can base their services on the same platforms.”

Will the Estonian government, which is like other governments, will spend three (3) times as much developing software on the off chance that the UK may want to use it?

Would any government undertake software development on that basis?

Do you have an answer other than NO! to either of those questions?

There are lots of competent computer people in the UK but none of them are advising The Rt Hon Matt Hancock MP. Or he isn’t listening. Amounts to the same thing.

Public Terminal on Your Network or Computer?

September 23rd, 2015

Update Flash now! Adobe releases patch, fixing critical security holes by Graham Cluley.

Graham details the latest in a series of patches for critical flaws in Flash and instead of completely removing Flash from your computer recommends:

Instead, I would suggest that Adobe Flash users consider enabling “Click to Play” in their browser.


And how are you going to decide if Flash content is malicious or not? Before you “click to play?”

To be honest, I can’t.

Flash on your computer is the equivalent of a public terminal to your network or computer on a street corner.

My recommendation? Remove Flash completely from your computer.

What about Flash content?

If I really want to view something that requires Flash, I write to the source saying I won’t install public access to my computer in order to view their content.

If enough of us do that, perhaps Flash will die the sort of death it deserves.

Coursera Specialization in Machine Learning:…

September 22nd, 2015

Coursera Specialization in Machine Learning: A New Way to Learn Machine Learning by Emily Fox.

From the post:

Machine learning is transforming how we experience the world as intelligent applications have become more pervasive over the past five years. Following this trend, there is an increasing demand for ML experts. To help meet this demand, Carlos and I were excited to team up with our colleagues at the University of Washington and Dato to develop a Coursera specialization in Machine Learning. Our goal is to avoid the standard prerequisite-heavy approach used in other ML courses. Instead, we motivate concepts through intuition and real-world applications, and solidify concepts with a very hands-on approach. The result is a self-paced, online program targeted at a broad audience and offered through Coursera with the first course available today.

Change how people learn about machine learning?

Do they mean to depart from simply replicating static textbook content in another medium?

Horrors! (NOT!)

Education has been evolving since the earliest days online and will continue to do so.

Still, it is encouraging to see people willing to admit to being different.


I first saw this in a tweet by Dato.

King – Man + Woman = Queen:…

September 22nd, 2015

King – Man + Woman = Queen: The Marvelous Mathematics of Computational Linguistics.

From the post:

Computational linguistics has dramatically changed the way researchers study and understand language. The ability to number-crunch huge amounts of words for the first time has led to entirely new ways of thinking about words and their relationship to one another.

This number-crunching shows exactly how often a word appears close to other words, an important factor in how they are used. So the word Olympics might appear close to words like running, jumping, and throwing but less often next to words like electron or stegosaurus. This set of relationships can be thought of as a multidimensional vector that describes how the word Olympics is used within a language, which itself can be thought of as a vector space.

And therein lies this massive change. This new approach allows languages to be treated like vector spaces with precise mathematical properties. Now the study of language is becoming a problem of vector space mathematics.

Today, Timothy Baldwin at the University of Melbourne in Australia and a few pals explore one of the curious mathematical properties of this vector space: that adding and subtracting vectors produces another vector in the same space.

The question they address is this: what do these composite vectors mean? And in exploring this question they find that the difference between vectors is a powerful tool for studying language and the relationship between words.

A great lay introduction to:

Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning by Ekaterina Vylomova, Laura Rimell, Trevor Cohn, Timothy Baldwin.


Recent work on word embeddings has shown that simple vector subtraction over pre-trained embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.

The authors readily admit, much to their credit, this isn’t a one size fits all solution.

But, a line of research that merits your attention.

Security Alert! Have You Seen This Drive?

September 22nd, 2015


The Ministry of Education, British Columbia, Canada posted MISSING DRIVE CONTENTS:

Despite extensive physical and electronic searches, the Ministry of Education has been unable to locate an unencrypted external hard drive with a variety of reports, databases, and some information detailed below.

The missing external drive is a black Western Digital drive about 7-inches high, 5.5 inches deep, and two inches thick. The disk has 437 GB worth of material made up of 8,766 folders with 138,830 files.

Inside some of the files is information on a total of 3.4 million individuals from between 1986-2009

The red color was in the original.

I’m not sure how listing the contents in detail is going to help find this drive but I do have a local copy should the online version disappear.

If I had to guess, someone converted the drive to home use and formatted it, losing the data of concern unless you want to pay for expensive data recovery efforts.

But, in the event it was stolen and sold along with other equipment, check any second hand Western digital drives you have purchased. Could be worth more than you paid for it.

I first saw this in a tweet by Dissent Doe today and I have no date for the actual data loss.

Text Making A Comeback As Interface?

September 22nd, 2015

Who Needs an Interface Anyway? Startups are piggybacking on text messaging to launch services. by Joshua Brustein.

From the post:

In his rush to get his latest startup off the ground, Ethan Bloch didn’t want to waste time designing a smartphone app. He thought people would appreciate the convenience of not having to download an app and then open it every time they wanted to use Digit, a tool that promotes savings. Introduced in February, it relies on text messaging to communicate with users. To sign up for the service, users go to Digit’s website and key in their cell number and checking account number. The software analyzes spending patterns and automatically sets money aside in a savings account. To see how much you’ve socked away, text “tell me my balance.” Key in “save more,” and Digit will do as you command. “A lot of the benefit of Digit takes place in the background. You don’t need to do anything,” says Bloch.

Conventional wisdom holds that intricately designed mobile apps are an essential part of most new consumer technology services. But there are signs people are getting apped out. While the amount of time U.S. smartphone users spend with apps continues to increase, the number of apps the average person uses has stayed pretty much flat for the last two years, according to a report Nielsen published in June. Some 200 apps account for more than 70 percent of total usage.

Golden Krishna, then a designer at Cooper, a San Francisco consulting firm that helps businesses create user experiences, anticipated the onset of app fatigue. In a 2012 blog post, “The best interface is no interface,” he argued that digital technology should strive to be invisible. It sparked a wide-ranging debate, and Krishna has spent the past several years making speeches, promoting a book with the same title as his essay, and doing consulting work for Silicon Valley companies.

Remembering the near ecstasy when visual interfaces replaced green screens it goes against experience to credit text as the best interface.

However, you should start with Golden Krishna’s essay, “The best interface is no interface,” then move on to his keynote address: “The best interface is no interface” at SXSW 2013 and of course, his website, http://www.nointerface.com/book/, which has many additional resources, including his book by the same name.

It is way cool to turn a blog post into a cottage industry. Not just any blog post, but a very good blog post on a critical issue for every user facing software application.

To further inspire you to consider text as an interface, take special note of the line that reads:

“Some 200 apps account for more than 70 percent of total usage.”

In order to become a top app, you have to not only displace one of the top 200 app, but your app has to be chosen to replace it. That sounds like an uphill battle.

Not to say that making a text interface is going to be easy, it’s not. You will have to think about the interface more than grabbing some stock widgets in order to build a visual interface.

On the upside, you may avoid the design clunkers that litter Krishna’s presentations and book.

An even better upside, you may avoid authoring one of the design clunkers that litter Krishna’s presentations.

I first saw this in a tweet by Bob DuCharme.

Python for Scientists [Warning – Sporadic Content Ahead]

September 22nd, 2015

Python for Scientists: A Curated Collection of Chapters from the O’Reilly Data and Programming Libraries

From the post:

More and more, scientists are seeing tech seep into their work. From data collection to team management, various tools exist to make your lives easier. But, where to start? Python is growing in popularity in scientific circles, due to its simple syntax and seemingly endless libraries. This free ebook gets you started on the path to a more streamlined process. With a collection of chapters from our top scientific books, you’ll learn about the various options that await you as you strengthen your computational thinking.

This free ebook includes chapters from:

  • Python for Data Analysis
  • Effective Computation in Physics
  • Bioinformatics Data Skills
  • Python Data Science Handbook

Warning: You give your name and email to the O’Reilly marketing marketing machine and get:

Python for Data Analysis

Python Language Essentials Appendix

Effective Computation in Physics

Chapter 1: Introduction to the Command Line
Chapter 7: Analysis and Visualization
Chapter 20: Publication

Bioinformatics Data Skills

Chapter 4: Working with Remote Machines
Chapter 5: Git for Scientists

Python Data Science Handbook

Chapter 3: Introduction to NumPy
Chapter 4: Introduction to Pandas

The content present is very good. The content missing is vast.

Topic Modeling and Twitter

September 22nd, 2015

Alex Perrier has two recent posts of interest to Twitter users and topic modelers:

Topic Modeling of Twitter Followers

In this post, we explore LDA an unsupervised topic modeling method in the context of twitter timelines. Given a twitter account, is it possible to find out what subjects its followers are tweeting about?

Knowing the evolution or the segmentation of an account’s followers can give actionable insights to a marketing department into near real time concerns of existing or potential customers. Carrying topic analysis of followers of politicians can produce a complementary view of opinion polls.

Segmentation of Twitter Timelines via Topic Modeling

Following up on our first post on the subject, Topic Modeling of Twitter Followers, we compare different unsupervised methods to further analyze the timelines of the followers of the @alexip account. We compare the results obtained through Latent Semantic Analysis and Latent Dirichlet Allocation and we segment Twitter timelines based on the inferred topics. We find the optimal number of clusters using silhouette scoring.

Alex has Python code, an interesting topic, great suggestions for additional reading, what is there not to like?

LDA, machine learning types follow @alexip but privacy advocates should as well.

Consider this recent tweet by Alex:

In the end the best way to protect your privacy is to behave erratically so that the Machine Learning algo will detect you as an outlier!

Perhaps, perhaps, but I suspect outliers/outsiders are classed as dangerous by several government agencies in the US.

Christmas in October? (Economics of Cybersecurity)

September 22nd, 2015

Tell us how to infect an iPhone remotely, and we’ll give you $1,000,000 USD by Graham Cluley.

From the post:

If there’s something which is in high demand from both the common internet criminals and intelligence agencies around the world, it’s a way of easily infecting the iPhones and iPads of individuals.

The proof that there is high demand for a way to remotely and reliably exploit iOS devices, in order to install malware that can spy upon communications and snoop upon a user’s whereabouts, is proven by a staggering $1 million reward being offered by one firm for exclusive details of such a flaw.

In an announcement on its website, newly-founded vulnerability broker Zerodium, offers the million dollar bounty to “each individual or team who creates and submits an exclusive, browser-based, and untethered jailbreak for the latest Apple iOS 9 operating system and devices.”

There’s no denying – that’s a lot of cash. And Zerodium says it won’t stop there. In fact, it says that it will offer a grand total of $3 million in rewards for iOS 9 exploits and jailbreaks.

Graham says the most likely buyers from Zerodium are governments more likely to pay large sums than Microsoft or Apple.

There a reason for that. Microsoft, Apple, Cisco, etc., face no economic down side from zero-day exploits.

Zero-day exploits tarnish reputations or so it is claimed. For most vendors it would be hard to find another black mark in addition to all the existing ones.

If zero-day exploits had an impact on sales, the current vendor landscape would be far different than it is today.

With no economic impact on sales or reputations, it is easy to understand the complacency of vendors in the face of zero-day exploits and contests to create the same.

I keep using the phrase “economic impact on” to distinguish economic consequences from all the hand wringing and tough talk you hear from vendors about cybersecurity. Unless and until something impacts the bottom line on a balance sheet, all the talk is just cant.

If some legislative body, Congress (in the U.S.) comes to mind, were to pass legislation that:

  • Imposes strict liability for all code level vulnerabilities
  • Establishes a minimum level of presumed damages plus court costs and attorneys fees
  • A expedited process for resolving claims within six months
  • Establish tax credits for zero-day exploits purchased by vendors

the economics of cybersecurity would change significantly.

Vendors would have economic incentives to both write cleaner code and to purchase zero-day exploits on the open market.

Hackers would have economic incentives to find hacks because there is automatic liability on the part of software vendors for their exploits.

The time has come to end the free ride for software vendors on the issue of liability for software exploits.

The result will be a safer world for everyone.

Python & R codes for Machine Learning

September 21st, 2015

While I am thinking about machine learning, I wanted to mention: Cheatsheet – Python & R codes for common Machine Learning Algorithms by Manish Saraswat.

From the post:

In his famous book – Think and Grow Rich, Napolean Hill narrates story of Darby, who after digging for a gold vein for a few years walks away from it when he was three feet away from it!

Now, I don’t know whether the story is true or false. But, I surely know of a few Data Darby around me. These people understand the purpose of machine learning, its execution and use just a set 2 – 3 algorithms on whatever problem they are working on. They don’t update themselves with better algorithms or techniques, because they are too tough or they are time consuming.

Like Darby, they are surely missing from a lot of action after reaching this close! In the end, they give up on machine learning by saying it is very computation heavy or it is very difficult or I can’t improve my models above a threshold – what’s the point? Have you heard them?

Today’s cheat sheet aims to change a few Data Darby’s to machine learning advocates. Here’s a collection of 10 most commonly used machine learning algorithms with their codes in Python and R. Considering the rising usage of machine learning in building models, this cheat sheet is good to act as a code guide to help you bring these machine learning algorithms to use. Good Luck!

Here’s a very good idea! Whether you want to learn these algorithms or a new Emacs mode. ;-)

Sure, you can always look up the answer but that breaks your chain of thought, over and over again.


Machine-Learning-Cheat-Sheet [Cheating Machine Learning?]

September 21st, 2015

Machine-Learning-Cheat-Sheet by Frank Dai.

From the Preface:

This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly recall knowledge and ideas in machine learning.

This cheat sheet has three significant advantages:

1. Strong typed. Compared to programming languages, mathematical formulas are weakly typed. For example, X can be a set, a random variable, or a matrix. This causes difficulty in understanding the meaning of formulas. In this cheat sheet, I try my best to standardize symbols used, see section §.

2. More parentheses. In machine learning, authors are prone to omit parentheses, brackets and braces, this usually causes ambiguity in mathematical formulas. In this cheat sheet, I use parentheses(brackets and braces) at where they are needed, to make formulas easy to understand.

3. Less thinking jumps. In many books, authors are prone to omit some steps that are trivial in his option. But it often makes readers get lost in the middle way of derivation.

Two other advantages of this “cheat-sheet” are that it resides on Github and is written using the Springer LaTeX template.

Neural networks can be easily fooled, Deep Neural Networks are Easily Fooled:… so the question becomes, how easy is it to fool the machine learning algorithms summarized by Frank Dai?

Or to put it another way, if I know the machine algorithm most likely to be used, what steps, if any, can I take to shape data to influence the likely outcome?

Excluding outright false data because that would be too easily detected and possibly trip too many alarms.

The more you know about how an algorithm can be cheated, the safer you will be in evaluating the machine learning results of others.

I first saw this in a tweet by Kirk Borne.

Are You Deep Mining Shallow Data?

September 21st, 2015

Do you remember this verse of Simple Simon?

Simple Simon went a-fishing,

For to catch a whale;

All the water he had got,

Was in his mother’s pail.


Shallow data?

To illustrate, fill in the following statement:

My mom makes the best _____.

Before completing that statement, you resolved the common noun, “mom,” differently that I did.

The string carries no clue as to the resolution of “mom” by any reader.

The string also gives no clues as to how it would be written in another language.

With a string, all you get is the string, or in other words:

All strings are shallow.

That applies to the strings we use to add depth to strings but we will reach that issue shortly.

One of the few things that RDF got right was:

…RDF puts the information in a formal way that a machine can understand. The purpose of RDF is to provide an encoding and interpretation mechanism so that resources can be described in a way that particular software can understand it; in other words, so that software can access and use information that it otherwise couldn’t use. (quote from Wikipedia on RDF)

In addition to the string, RDF posits an identifier in the form of a URI which you can follow to discover more information about that portion of string.

Unfortunately RDF was burdened by the need for all new identifiers to replace those already in place, an inability to easily distinguish identifier URIs from URIs that lead to subjects of conversation, and encoding requirements that reduced the population of potential RDF authors to a righteous remnant.

Despite its limitations and architectural flaws, RDF is evidence that strings are indeed shallow. Not to mention that if we could give strings depth, their usefulness would be greatly increased.

One method for imputing more depth to strings is natural language processing (NLP). Modern NLP techniques are based on statistical analysis of large data sets and are the most accurate for very common cases. The statistical nature of NLP makes application of those techniques to very small amounts of text or ones with unusual styles of usage problematic.

The limits of statistical techniques isn’t a criticism of NLP but rather an observation that depending on the level of accuracy desired and your data, such techniques may or may not be useful.

What is acceptable for imputing depth to strings in movie reviews is unlikely to be thought so when deciphering a manual for disassembling an atomic weapon. The question isn’t whether NLP can impute depth to strings but whether that imputation is sufficiently accurate for your use case.

Of course, RDF and NLP aren’t the only two means for imputing depth to strings.

We will take up another method for giving strings depth tomorrow.

Announcing Spark 1.5

September 20th, 2015

Announcing Spark 1.5 by Reynold Xin and Patrick Wendell.

From the post:

Today we are happy to announce the availability of Apache Spark’s 1.5 release! In this post, we outline the major development themes in Spark 1.5 and some of the new features we are most excited about. In the coming weeks, our blog will feature more detailed posts on specific components of Spark 1.5. For a comprehensive list of features in Spark 1.5, you can also find the detailed Apache release notes below.

Many of the major changes in Spark 1.5 are under-the-hood changes to improve Spark’s performance, usability, and operational stability. Spark 1.5 ships major pieces of Project Tungsten, an initiative focused on increasing Spark’s performance through several low-level architectural optimizations. The release also adds operational features for the streaming component, such as backpressure support. Another major theme of this release is data science: Spark 1.5 ships several new machine learning algorithms and utilities, and extends Spark’s new R API.

One interesting tidbit is that in Spark 1.5, we have crossed the 10,000 mark for JIRA number (i.e. more than 10,000 tickets have been filed to request features or report bugs). Hopefully the added digit won’t slow down our development too much!

It’s time to upgrade your Spark installation again!


10 Misconceptions about Neural Networks [Update to car numberplate game?]

September 19th, 2015

10 Misconceptions about Neural Networks by Stuart Reid.

From the post:

Neural networks are one of the most popular and powerful classes of machine learning algorithms. In quantitative finance neural networks are often used for time-series forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling. They have also been used to construct stochastic process models and price derivatives. Despite their usefulness neural networks tend to have a bad reputation because their performance is “temperamental”. In my opinion this can be attributed to poor network design owing to misconceptions regarding how neural networks work. This article discusses some of those misconceptions.

The car numberplate game was a game where passengers in a car, usually children, would compete to find license plates from different states (in the US). That was prior to children being entombed in intellectual isolation bubbles with iPads, Gameboys, DVD players and wireless access, while riding.

Hard to believe but some people used to look outside the vehicle in which they were riding. Now of course what little attention they have is captured by cellphones and not other occupants of the same vehicle.

Rather than rail against that trend, may I suggest we update the car numberplate game to “mistakes about neural networks?”

Using Stuart’s post as a baseline, send a text message to each passenger pointing to Stuart’s post and requesting a count of the number of “mistakes about neural networks” they can find in an hour.

Personally I would put popular media off limits for post-high school players to keep the scores under four digits.

When discussing the scores, after sharing browsing histories, each player has to analyze the claimed error and match it to one on Stuart’s list.

I realize that will require full bandwidth communication with others in your physical presence but with practice, that won’t seem so terribly odd.

I first saw this in a tweet by Kirk Borne.