## Archive for March, 2015

### Polyglot Data Management – Big Data Everywhere Recap

Monday, March 23rd, 2015

From the post:

At the Big Data Everywhere conference held in Atlanta, Senior Software Engineer Mike Davis and Senior Solution Architect Matt Anderson from Liaison Technologies gave an in-depth talk titled “Polyglot Data Management,” where they discussed how to build a polyglot data management platform that gives users the flexibility to choose the right tool for the job, instead of being forced into a solution that might not be optimal. They discussed the makeup of an enterprise data management platform and how it can be leveraged to meet a wide variety of business use cases in a scalable, supportable, and configurable way.

Matt began the talk by describing the three components that make up a data management system: structure, governance and performance. “Person data” was presented as a good example when thinking about these different components, as it includes demographic information, sensitive information such as social security numbers and credit card information, as well as public information such as Facebook posts, tweets, and YouTube videos. The data management system components include:

It’s a vendor pitch so read with care but it comes closer than any other pitch I have seen to capturing the dynamic nature of data. Data isn’t the same from every source and you treat it the same at your peril.

If I had to say the pitch has a theme it is to adapt your solutions to your data and goals, not the other way around.

The one place where I may depart from the pitch is on the meaning of “normalization.” True enough we may want to normalize data a particular way this week, this month, but that should no preclude us from other “normalizations” should our data or requirements change.

The danger I see in “normalization” is that the cost of changing static ontologies, schemas, etc., leads to their continued use long after they have passed their discard dates. If you are as flexible with regard to your information structures as you are your data, then new data or requirements are easier to accommodate.

Or to put it differently, what is the use of being flexible with data if you intend to imprison it in a fixed labyrinth?

### Using scikit-learn Pipelines and FeatureUnions

Monday, March 23rd, 2015

From the post:

Since I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions.

The pipeline module of scikit-learn allows you to chain transformers and estimators together in such a way that you can use them as a single unit. This comes in very handy when you need to jump through a few hoops of data extraction, transformation, normalization, and finally train your model (or use it to generate predictions).

When I first started participating in Kaggle competitions, I would invariably get started with some code that looked similar to this:

train = read_file('data/train.tsv')
train_y = extract_targets(train)
train_essays = extract_essays(train)
train_tokens = get_tokens(train_essays)
train_features = extract_feactures(train)
classifier = MultinomialNB()

scores = []
train_idx, cv_idx in KFold():
classifier.fit(train_features[train_idx], train_y[train_idx])
scores.append(model.score(train_features[cv_idx], train_y[cv_idx]))

print("Score: {}".format(np.mean(scores)))


Often, this would yield a pretty decent score for a first submission. To improve my ranking on the leaderboard, I would try extracting some more features from the data. Let's say in instead of text n-gram counts, I wanted tf–idf. In addition, I wanted to include overall essay length. I might as well throw in misspelling counts while I'm at it. Well, I can just tack those into the implementation of extract_features. I'd extract three matrices of features–one for each of those ideas and then concatenate them along axis 1. Easy.

Zac has quite a bit of practical advice for how to improve your use of scikit-learn. Just what you need to start a week in the Spring!

Enjoy!

I first saw this in a tweet by Vineet Vashishta.

### MapR Sandbox Fastest On-Ramp to Hadoop

Monday, March 23rd, 2015

MapR Sandbox Fastest On-Ramp to Hadoop

From the webpage:

The MapR Sandbox for Hadoop provides tutorials, demo applications, and browser-based user interfaces to let developers and administrators get started quickly with Hadoop. It is a fully functional Hadoop cluster running in a virtual machine. You can try our Sandbox now – it is completely free and available as a VMware or VirtualBox VM.

If you are a business intelligence analyst or a developer interested in self-service data exploration on Hadoop using SQL and BI Tools, the MapR Sandbox including Apache Drill will get you started quickly. You can download the Drill Sandbox here.

You of course know about the Hortonworks and Cloudera (at the very bottom of the page) sandboxes as well.

Don’t expect a detailed comparison of all three because the features and distributions change too quickly for that to be useful. And my interest is more in capturing the style or approach that may make a difference to a starting user.

Enjoy!

I first saw this in a tweet by Kirk Borne.

### Classifying Plankton With Deep Neural Networks

Monday, March 23rd, 2015

Classifying Plankton With Deep Neural Networks by Sander Dieleman.

From the post:

The National Data Science Bowl, a data science competition where the goal was to classify images of plankton, has just ended. I participated with six other members of my research lab, the Reservoir lab of prof. Joni Dambre at Ghent University in Belgium. Our team finished 1st! In this post, we’ll explain our approach.

The ≋ Deep Sea ≋ team consisted of Aäron van den Oord, Ira Korshunova, Jeroen Burms, Jonas Degrave, Lionel Pigou, Pieter Buteneers and myself. We are all master students, PhD students and post-docs at Ghent University. We decided to participate together because we are all very interested in deep learning, and a collaborative effort to solve a practical problem is a great way to learn.

There were seven of us, so over the course of three months, we were able to try a plethora of different things, including a bunch of recently published techniques, and a couple of novelties. This blog post was written jointly by the team and will cover all the different ingredients that went into our solution in some detail.

## Overview

This blog post is going to be pretty long! Here’s an overview of the different sections. If you want to skip ahead, just click the section title to go there.

## Introduction

### The problem

The goal of the competition was to classify grayscale images of plankton into one of 121 classes. They were created using an underwater camera that is towed through an area. The resulting images are then used by scientists to determine which species occur in this area, and how common they are. There are typically a lot of these images, and they need to be annotated before any conclusions can be drawn. Automating this process as much as possible should save a lot of time!

The images obtained using the camera were already processed by a segmentation algorithm to identify and isolate individual organisms, and then cropped accordingly. Interestingly, the size of an organism in the resulting images is proportional to its actual size, and does not depend on the distance to the lens of the camera. This means that size carries useful information for the task of identifying the species. In practice it also means that all the images in the dataset have different sizes.

Participants were expected to build a model that produces a probability distribution across the 121 classes for each image. These predicted distributions were scored using the log loss (which corresponds to the negative log likelihood or equivalently the cross-entropy loss).

This loss function has some interesting properties: for one, it is extremely sensitive to overconfident predictions. If your model predicts a probability of 1 for a certain class, and it happens to be wrong, the loss becomes infinite. It is also differentiable, which means that models trained with gradient-based methods (such as neural networks) can optimize it directly – it is unnecessary to use a surrogate loss function.

Interestingly, optimizing the log loss is not quite the same as optimizing classification accuracy. Although the two are obviously correlated, we paid special attention to this because it was often the case that significant improvements to the log loss would barely affect the classification accuracy of the models.

This rocks!

Code is coming soon to Github!

Certainly of interest to marine scientists but also to anyone in bio-medical imaging.

The problem of too much data and too few experts is a common one.

What I don’t recall seeing are releases of pre-trained classifiers. Is the art developing too quickly for that to be a viable product? Just curious.

I first saw this in a tweet by Angela Zutavern.

### ICDM ’15: The 15th IEEE International Conference on Data Mining

Monday, March 23rd, 2015

ICDM ’15: The 15th IEEE International Conference on Data Mining November 14-17, 2015, Atlantic City, NJ, USA

Important dates:

All deadlines are at 11:59PM Pacific Daylight Time
* Workshop notification:                             Mar 29, 2015
* ICDM contest proposals:                            Mar 29, 2015
* Full paper submissions:                            Jun 03, 2015
* Demo proposals:                                    Jul 13, 2015
* Workshop paper submissions:                        Jul 20, 2015
* Tutorial proposals:                                Aug 01, 2015
* Conference paper, tutorial, demo notifications:    Aug 25, 2015
* Workshop paper notifications:                      Sep 01, 2015
* Conference dates:                                  Nov 14-17, 2015


From the post:

The IEEE International Conference on Data Mining series (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications. ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing. By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining. Besides the technical program, the conference features workshops, tutorials, panels and, since 2007, the ICDM data mining contest.

Topics of Interest
******************

Topics of interest include, but are not limited to:

* Foundations, algorithms, models, and theory of data mining
* Machine learning and statistical methods for data mining
* Mining text, semi-structured, spatio-temporal, streaming, graph, web, multimedia data
* Data mining systems and platforms, their efficiency, scalability, and privacy
* Data mining in modeling, visualization, personalization, and recommendation
* Applications of data mining in all domains including social, web, bioinformatics, and finance

An excellent conference but unlikely to be as much fun as Balisage. The IEEE conference will be the pocket protector crowd whereas Balisage features a number of wooly-pated truants (think Hobbits), some of which don’t even wear shoes. Some of them wear hats though. Large colorful hats. Think Mad Hatter and you are close.

If your travel schedule permits do both Balisage and this conference.

Enjoy!

### Unstructured Topic Map-Like Data Powering AI

Monday, March 23rd, 2015

From the post:

Such mining of digitized information has become more effective and powerful as more info is “tagged” and as analytics engines have gotten smarter. As Dario Gil, Director of Symbiotic Cognitive Systems at IBM Research, told me:

“Data is increasingly tagged and categorized on the Web – as people upload and use data they are also contributing to annotation through their comments and digital footprints. This annotated data is greatly facilitating the training of machine learning algorithms without demanding that the machine-learning experts manually catalogue and index the world. Thanks to computers with massive parallelism, we can use the equivalent of crowdsourcing to learn which algorithms create better answers. For example, when IBM’s Watson computer played ‘Jeopardy!,’ the system used hundreds of scoring engines, and all the hypotheses were fed through the different engines and scored in parallel. It then weighted the algorithms that did a better job to provide a final answer with precision and confidence.”

Granting that the tagging and annotation is unstructured, unlike a topic map, but it is as unconstrained by first order logic and other crippling features of RDF and OWL. Out of that mass of annotations, algorithms can construct useful answers.

Imagine what non-experts (Stanford logic refugees need not apply) could author about your domain, to be fed into an AI algorithm. That would take more effort than relying upon users chancing upon subjects of interest but it would also give you greater precision in the results.

Perhaps, just perhaps, one of the errors in the early topic maps days was the insistence on high editorial quality at the outset, as opposed to allowing editorial quality to emerge out of data.

As an editor I’m far more in favor of the former than the latter but seeing the latter work, makes me doubt that stringent editorial control is the only path to an acceptable degree of editorial quality.

What would a rough-cut topic map authoring interface look like?

Suggestions?

### Pwn2Own +1!

Monday, March 23rd, 2015

Paul details the results from Pwn2Own 2015 and gives a great run down on the background of the contest. A must read if you are interested in cybersecurity competitions. Here the targets were:

• Windows
• Microsoft IE 11
• Mozilla Firefox
• Apple Safari

Bugs were found in all and system access obtained in four cases.

I mention this in part to ask you to participate in Paul’s poll on whether Pwn2Own contests are a good idea.

As you can imagine, I think they rock!

Assuming the winners did devote a substantial amount of time prior to the contest, a \$110,000 prize (by one winner) is no small matter.

Paul cites critics as saying:

it makes security molehills into theatrical mountains.

I don’t know who the critics are but system level access sounds like more a molehill to me.

Critics of Pwn2Own are dour faced folks who want bugs reported to vendors and with an unlimited time to fix them, whether they acknowledge the report or not, and if they do, you should be satisfied with an “atta boy/girl” and maybe a free year’s subscription to a PC gaming zine.

Let’s see, vendors sell buggy software for a profit, accept no liability for it, abuse/neglect reporters of bugs, and then want reporters of bugs to contribute their work for free. Plus keep your knowledge secret for the “good of the community.”

Do you see a pattern there?

Screw that!

Vote in favor of Pwn2Own and organize similar events!

### From Nand to Tetris / Part I [“Not for everybody.”]

Monday, March 23rd, 2015

From Nand to Tetris / Part I April 11 – June 7 2015

From the webpage:

Build a modern computer system, starting from first principles. The course consists of six weekly hands-on projects that take you from constructing elementary logic gates all the way to building a fully functioning general purpose computer. In the process, you will learn — in the most direct and intimate way — how computers work, and how they are designed.

This course is a fascinating 7-week voyage of discovery in which you will go all the way from Boolean algebra and elementary logic gates to building a central processing unit, a memory system, and a hardware platform, leading up to a general-purpose computer that can run any program that you fancy. In the process of building this computer you will become familiar with many important hardware abstractions, and you will implement them, hands on. But most of all, you will enjoy the tremendous thrill of building a complex and useful system from the ground up.

You will build all the hardware modules on your home computer, using a Hardware Description Language (HDL), learned in the course, and a hardware simulator, supplied by us. A hardware simulator is a software system that enables building and simulating gates and chips before actually committing them to silicon. This is exactly what hardware engineers do in practice: they build and test computers in simulation, using HDL and hardware simulators.

Do you trust locks?

Do you know how locks work?

I don’t and yet I trust locks to work. But then a lock requires physical presence to be opened and locks do have a history of defeating attempts to unlock them without the key. Not always but a high percentage of the time.

Do you trust computers?

Do you know how computers work?

I don’t, not really. Not at the level of silicon.

So why would I trust computers? We know computers are as faithful as a napkin at a party and have no history of being secure, for anyone.

Necessity seems like a weak answer doesn’t it? Trusting computers to be insecure seems like a better answer.

Not that everyone wants or needs to delve into computers at the level of silicon but exposure to the topic doesn’t hurt.

Might even help when you hear of hardware hacks like rowhammer. You don’t really think that is the last of the hardware hacks do you? Seriously?

BTW, I first read about this course in the Clojure Gazette, which is a great read, whether you are a Clojure programmer or not. Take a look and consider subscribing. Another reason to subscribe is that it lists a smail address of New Orleans, Louisiana.

Even the fast food places have good food in New Orleans. The non-fast food has to be experienced. Words are not enough. It would be like trying to describe sex to someone who has only read about it. Just not the same. Every conference should be in New Orleans every two or three years.

After you get through day-dreaming about New Orleans, go ahead and register for From Nand to Tetris / Part I April 11 – June 7 2015

### A Well Regulated Militia

Sunday, March 22nd, 2015

From the post:

The National Security Agency want to be able to hack more people, vacuum up even more of your internet records and have the keys to tech companies’ encryption – and, after 18 months of embarrassing inaction from Congress on surveillance reform, the NSA is now lobbying it for more powers, not less.

NSA director Mike Rogers testified in front of a Senate committee this week, lamenting that the poor ol’ NSA just doesn’t have the “cyber-offensive” capabilities (read: the ability to hack people) it needs to adequately defend the US. How cyber-attacking countries will help cyber-defense is anybody’s guess, but the idea that the NSA is somehow hamstrung is absurd.

Like everyone else I like reading hacking stories, particularly the more colorful ones! But for me, at least until now, hacking has been like debugging core dumps, it’s an interesting technical exercise but not much more than that.

I am incurious about the gossip the NSA is sweeping up for code word access, but I am convinced that we all need a strong arm to defend our digital privacy and the right to tools to protect ourselves.

The dangers to citizens have changed since James Madison wrote in the Bill or Rights:

“A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.”

In 1789, oppression and warfare was conducted with muzzle loaders and swords. Guns are still a common means of oppression, but the tools of oppression have grown since 1789. Then there was no mass surveillance of phone traffic, bank accounts, camera feeds, not to mention harvesting of all network traffic. Now, all of those things are true.

Our reading of the Second Amendment needs to be updated to include computers, software developed for hacking, training for hackers and research on hacking. Knowing how to break encryption isn’t the same thing as illegally breaking encryption. It is a good way to test whether the promised encryption will exclude prying government eyes.

I’m not interested in feel good victories that come years after over reaching by the government. It’s time for someone to take up the gage that the NSA has flung down in the street. Someone who traffics in political futures and isn’t afraid to get their hands dirty.

The NRA has been a long term and successful advocate for Second Amendment rights. And they have political connections that would take years to develop. When was the last time you heard of the NRA winning symbolic victories for someone after they had been victimized? Or do you hear of victories by the NRA before their membership is harmed by legislation? Such as anti-hacking legislation.

Since the NRA is an established defender of the Second Amendment, with a lot of political clout, let’s work on expanding the definition of “arms” in the Second Amendment to include computers, knowledge of how to break encryption and security systems, etc.

The first step is to join the NRA (like everybody they listen to paying members first).

The second step is educate other NRA members and the public posed by unchecked government cyberpower. Current NRA members may die with their guns in hand but government snoops know what weapons they have, ammunition, known associates, and all of that is without gun registration. A machine pistol is a real mis-match against digital government surveillance. As in the losing side.

The third step is to start training yourself as a hacker. Setup a small network at home so you can educate yourself, off of public networks, about the weaknesses of hardware and software. Create or join computer clubs dedicated to learning hacking arts.

BTW, the people urging you to hack Y12 (a nuclear weapons facility), Chase and the White House are all FBI plants. Privately circulate their biometrics to other clubs. Better informants that have been identified than unknowns. Promptly report all illegal suggestions from plants. You will have the security agencies chasing their own tails.

Take this as a warm-up. I need to dust off some of my Second Amendment history. Suggestions and comments are always welcome.

Looking forward to the day when even passive government surveillance sets off alarms all over the net.

### Balisage submissions are due on April 17th

Saturday, March 21st, 2015

Balisage submissions are due on April 17th!

Yeah, that’s what I thought when I saw the email from Tommie Usdin earlier this week!

Tommie writes:

Just a friendly reminder: Balisage submissions are due on April 17th! That’s just under a month.

Do you want to speak at Balisage? Participate in the pre-conference symposium on Cultural Heritage Markup? Then it is time to put some work in on your paper!

See the Call for Participations at:

http://www.balisage.net/Call4Participation.html

http://www.balisage.net/CulturalHeritage/index.html

Instructions for authors: http://www.balisage.net/authorinstructions.html

Do you need help with the mechanics of your Balisage submission? If we can help please send email to info@balisage.net

It can’t be the case that the deep learning, GPU toting AI folks have had all the fun this past year. After all, without data they would not have anything to be sexy about. Or is that with? Never really sure with those folks.

What I am sure about is that the markup folks at Balisage are poised to save Big Data from becoming Big Dark Data without any semantics.

But they can’t do it without your help! Will you stand by and let darkness cover all of Big Data or will you fight to preserve markup and the semantics it carries?

Sharpen your markup! Back to back, our transparency against the legions of darkness.

Well, it may not get that radical because Tommie is such a nice person but she has to sleep sometime. 😉 After she’s asleep, then we rumble.

Be there!

### FaceNet: A Unified Embedding for Face Recognition and Clustering

Saturday, March 21st, 2015

Abstract:

Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result by 30% on both datasets. (emphasis in the original)

With accuracy at 99.63%, the possibilities are nearly endless. 😉

How long will it be before some start-up is buying ATM feeds from banks? Fast and accurate location information would be of interest to process servers, law enforcement, debt collectors, various government agencies, etc.

Looking a bit further ahead, ATM surrogate services will become a feature of better hotels and escort services.

### GCHQ May Be Spying On You!

Saturday, March 21st, 2015

GCHQ, like many similar agencies, have been given carte blanche to snoop around the world.

Dave reports that GCHQ has responded to this disclosure not with denial but protesting that it would never ever snoop without following all of the rules, except for those against snooping of course.

What fails in almost every government scandal isn’t the safeguards against wrong doing, but rather the safeguards against anyone discovering the wrong doing. Yes? So it isn’t that the government doesn’t lie, cheat, abuse, etc., but that they are seldom caught. Safeguards against government violating its own restrictions seem particularly weak.

The UK and other governments fail to realize every retreat from the rule of law damages the legitimacy of that government. If they think governing is difficult now, imagine the issues when the average citizen obeys the law only with due regard to the proximity of a police officer. People joke about that now but watch people obey even mindless traffic rules. To say nothing of more serious rules.

The further and further governments retreat into convenience of the moment decision making, the less and less call they will have on the average citizen to “do the right thing.” Why should they? Their leadership has set the example that whether it is lying to get elected (Benjamin Netanyahu) or lying to start a war (George W. Bush) or lying to get funding (Michael Rogers, its ok.

Since GCHQ has decided it isn’t subject to the law, would you report a plot against GCHQ or the UK government? (Assume you just overheard it and weren’t involved.)

### Memantic: A Medical Knowledge Discovery Engine

Saturday, March 21st, 2015

Abstract:

We present a system that constructs and maintains an up-to-date co-occurrence network of medical concepts based on continuously mining the latest biomedical literature. Users can explore this network visually via a concise online interface to quickly discover important and novel relationships between medical entities. This enables users to rapidly gain contextual understanding of their medical topics of interest, and we believe this constitutes a significant user experience improvement over contemporary search engines operating in the biomedical literature domain.

Alexei takes advantage of prior work on medical literature to index and display searches of medical literature in an “economical” way that can enable researchers to discover new relationships in the literature without being overwhelmed by bibliographic detail.

You will need to check my summary against the article but here is how I would describe Memantic:

Memantic indexes medical literature and records the co-occurrences of terms in every text. Those terms are mapped into a standard medical ontology (which reduces screen clutter). When a search is performed, the “results are displayed as nodes based on the medical ontology and includes relationships established by the co-occurrences found during indexing. This enables users to find relationships without the necessity of searching through multiple articles or deduping their search results manually.

As I understand it, Memantic is as much an effort at efficient visualization as it is an improvement in search technique.

Very much worth a slow read over the weekend!

I first saw this in a tweet by Sami Ghazali.

PS: I tried viewing the videos listed in the paper but wasn’t able to get any sound? Maybe you will have better luck.

### Where’s the big data?

Saturday, March 21st, 2015

Alex Woodie in Can’t Ignore the Big Data Revolution draws our attention to: Big Data Revolution by Rob Thomas and Patrick McSharry.

Not the first nor likely the last book on “big data,” but it did draw these comments from Thomas Hale:

Despite all the figures, though, the revolution is not entirely quantified after all. The material costs to businesses implied by installing data infrastructure, outsourcing data management to other companies, or storing data, are rarely enumerated. Given the variety of industries the authors tackle, this is understandable. But it seems the cost of the revolution (something big data itself might be inclined to predict) remains unknown.

The book is perhaps most interesting as a case study of the philosophical assumptions that underpin the growing obsession with data. Leaders of the revolution will have “the ability to suspend disbelief of what is possible, and to create their own definition of possible,” the authors write.

Their prose draws heavily on similar invocations of technological idealism, with the use of words such as “enlightenment”, “democratise”, “knowledge-based society” and “inspire”.

Part of their idea of progress implies a need to shift from opinion to fact. “Modern medicine is being governed by human judgment (opinion and bias), instead of data-based science,” state the authors.

Hale comes close but strikes short of the mark when he excuses the lack of data to justify the revolution.

The principal irony of this book and others in the big data orthodoxy is the lack of big data to justify the claims made on behalf of big data. If the evidence is lacking because big data isn’t in wide use, then the claims for big data are not “data-based” are they?

The claims for big data take on a more religious tinge, particularly when readers are urged to “suspend disbelief,” create new definitions of possible, to seek “enlightenment,” etc.

You may remember the near religious hysteria around intelligent agents and the Semantic Web, the remnants of which are still entangling libraries and government projects who haven’t gotten the word that it failed. In part because information issues are indifferent to the religious beliefs of humans.

The same is the case with both the problems and benefits of big data, whatever you believe them to be, those problems and benefits are deeply indifferent to your beliefs. What is more, your beliefs can’t change the nature of those problems and benefits.

Shouldn’t a “big data” book be data-driven and not the product of “human judgment (opinion and bias)”?

Careful readers will ask, hopefully before purchasing a copy of Big Data Revolution and thereby encouraging more publications on “big data” is:

Where’s the big data?

You can judge whether to purchase the volume on the basis of the answer to that question.

PS: Make no mistake, data can have value. But, spontaneous generation of value by piling data into ever increasing piles is just as bogus as spontaneous generation of life.

PPS: Your first tip off that there is no “big data” is the appearance of the study in book form. If there were “big data” to support their conclusions, you would need cloud storage to host it and tools to manipulate it. In that case, why do you need the print book?

### Turning the MS Battleship

Saturday, March 21st, 2015

Improving interoperability with DOM L3 XPath by Thomas Moore.

From the post:

As part of our ongoing focus on interoperability with the modern Web, we’ve been working on addressing an interoperability gap by writing an implementation of DOM L3 XPath in the Windows 10 Web platform. Today we’d like to share how we are closing this gap in Project Spartan’s new rendering engine with data from the modern Web.

Some History

Prior to IE’s support for DOM L3 Core and native XML documents in IE9, MSXML provided any XML handling and functionality to the Web as an ActiveX object. In addition to XMLHttpRequest, MSXML supported the XPath language through its own APIs, selectSingleNode and selectNodes. For applications based on and XML documents originating from MSXML, this works just fine. However, this doesn’t follow the W3C standards for interacting with XML documents or exposing XPath.

To accommodate a diversity of browsers, sites and libraries wrap XPath calls to switch to the right implementation. If you search for XPath examples or tutorials, you’ll immediately find results that check for IE-specific code to use MSXML for evaluating the query in a non-interoperable way:

It seems like a long time ago that a relatively senior Microsoft staffer told me that turning a battleship like MS takes time. No change, however important, is going to happen quickly. Just the way things are in a large organization.

The important thing to remember is that once change starts, that too takes on a certain momentum and so is more likely to continue, even though it was hard to get started.

Yes, I am sure the present steps towards greater interoperability could have gone further, in another direction, etc. but they didn’t. Rather than complain about the present change for the better, why not use that as a wedge to push for greater support for more recent XML standards?

For my part, I guess I need to get a copy of Windows 10 on a VM so I can volunteer as a beta tester for full XPath (XQuery?/XSLT?) support in a future web browser. MS as a full XML competitor and possible source of open source software would generate some excitement in the XML community!

### NSA Chief Crys Wolf! (Again)

Saturday, March 21st, 2015

Cyber Attackers Leaving Warning ‘Messages’: NSA Chief

From the post:

Admiral Michael Rogers, director of the National Security Agency and head of the Pentagon’s US Cyber Command, made the comments to a US Senate panel as he warned about the growing sophistication of cyber threats.

“Private security researchers over the last year have reported on numerous malware finds in the industrial control systems of energy sector organizations,” Rogers said in written testimony. ”

Of particular risk is so-called critical infrastructure networks — power grids, transportation, water and air traffic control, for example — where a computer outage could be devastating.

Rogers added that the military is about halfway toward building its new cyber defense corps of 6,200 which could help in defending the national against cyber attacks.

Wait for it…

But he told the lawmakers on the Armed Services Committee that any budget cuts or delays in authorizing funds “will slow the build of our cyber teams” and hurt US defense efforts in cyberspace. (emphasis added)

So, the real issue is that Admiral Rogers doesn’t want to lose funding. Why didn’t he just say that and skip lying about the threat to infrastructure?

The Naval Academy Honor Concept doesn’t back Rogers on this point:

They tell the truth and ensure that the full truth is known. They do not lie.

Ted G. Lewis in Critical Infrastructure Protection in Homeland Security notes:

Digital Pearl Harbors are unlikely. Infrastructure systems, because they have to deal with failure on a routine basis, are also more flexible and responsive in restoring service than early analysts realized. Cyber attacks, unless accompanied by a simultaneous physical attack that achieves physical damage, are short-lived and ineffective.

Everyone in the United States has experienced loss of electrical power or telephone communications due to bad weather. Moreover, industrial control systems aren’t part of the Internet.

Rogers is training “cyber-warriors” for the wrong battlefield. Rogers can’t get access to the private networks where Stuxnet, etc., might be a problem so he is training “cyber-warriors” to fight where they can get access.

Huh? Isn’t that rather dumb? Training to fight on the Internet when the attack will come by invasion of private networks? That doesn’t sound like a winning strategy to me. Maybe Rogers doesn’t know the difference between the Internet and private networks. They do both use network cabling.

It’s not just me that disagrees with Admiral Rogers’ long face about critical infrastructure. Jame Clapper, you remember, the habitual liar to Congress? and also Director of National Intelligence, he disagrees with Rogers:

If there is good news, he said, it is that a catastrophic destruction of infrastructure appears unlikely.

“Cyber threats to U.S. national and economic security are increasing in frequency, scale, sophistication, and severity of impact,” the written assessment says. “Rather than a ‘Cyber Armageddon’ scenario that debilitates the entire US infrastructure, we envision something different. We foresee an ongoing series of low-to-moderate level cyber attacks from a variety of sources over time, which will impose cumulative costs on U.S. economic competitiveness and national security.”

Of course, Clapper may be lying again. But he could be accidentally telling the truth. Picked up the wrong briefing paper on his way out of the office. Mistakes do happen.

Unless and until Admiral Rogers specifies the “…numerous malware finds in the industrial control systems….” and specifies how his “cyber-warriors” have the ability to stop such malware attacks, all funding for the program should cease.

Connecting the dots in procurement of cybersecurity services could provide more protection to United States infrastructure that stopping every cyber attack over the next several years.

Friday, March 20th, 2015

Hacking Your Neighbor’s Wi-Fi: Practical Attacks Against Wi-Fi Security

From the post:

While the access points in organizations are usually under the protection of organization-wide security policies, home routers are less likely to be appropriately configured by their owners in absence of such central control. This provides a window of opportunity to neighboring Wi-Fi hackers. We talk about hacking a neighbor’s Wi-Fi since proximity to the access point is a must for wireless hacking—which is not an issue for a neighbor with an external antenna. With abundance of automated Wi-Fi hacking tools such as ‘Wifite’, it no longer takes a skilled attacker to breach Wi-Fi security. Chances are high that one of your tech-savvy neighbors would eventually exploit a poorly configured access point. The purpose may or may not be malicious; sometimes it may simply be out of curiosity. However, it is best to be aware of and secure your Wi-Fi against attacks from such parties.

For all the attention that bank and insurance company hacks get, having your own Wi-Fi hacked would be personally annoying.

Take the opportunity to check and correct any Wi-Fi security issues with your network. If you aren’t easy, it may encourage script kiddies to go elsewhere. And could make life more difficult for the alphabet agencies, which is always an added plus.

I first saw this in a tweet by NuHarbor Security.

### Convolutional Neural Networks for Visual Recognition

Friday, March 20th, 2015

From the description:

Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. The final assignment will involve training a multi-million parameter convolutional neural network and applying it on the largest image classification dataset (ImageNet). We will focus on teaching how to set up the problem of image recognition, the learning algorithms (e.g. backpropagation), practical engineering tricks for training and fine-tuning the networks and guide the students through hands-on assignments and a final course project. Much of the background and materials of this course will be drawn from the ImageNet Challenge.

Be sure to check out the course notes!

A very nice companion for your DIGITS experiments over the weekend.

I first saw this in a tweet by Lasse.

### Landsat-live goes live

Friday, March 20th, 2015

From the post:

Today we’re releasing the first edition of Landsat-live, a map that is constantly refreshed with the latest satellite imagery from NASA’s Landsat 8 satellite. Landsat 8 data is now publicly available on Amazon S3 via the new Landsat on AWS Public Data Set, making our live pipeline possible. We’re ingesting the data directly from Amazon S3, which is how we’re able to go from satellite to Mapbox map faster than ever. With every pixel captured within the past 32 days, Landsat-live features the freshest imagery possible around the entire planet.

With a 30 meter resolution, a 16 day revisit rate, and 10 multispectral bands, this imagery can be used to check the health of agricultural fields, the latest update on a natural disaster, or the progression of deforestation. Interact with the map above to see the freshest imagery anywhere in the world. Be sure to check back often and observe the constantly changing nature of our planet as same day imagery hits this constantly updating map. Scroll down the page to see some of our favorite stills of the earth from Landsat’s latest collection.

See Camilla’s post, you will really like the images.

Even with 30 meter resolution you will be able to document the impact of mapping projects that are making remote areas more accessible to exploitation.

### DIGITS: Deep Learning GPU Training System

Friday, March 20th, 2015

From the post:

The hottest area in machine learning today is Deep Learning, which uses Deep Neural Networks (DNNs) to teach computers to detect recognizable concepts in data. Researchers and industry practitioners are using DNNs in image and video classification, computer vision, speech recognition, natural language processing, and audio recognition, among other applications.

The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training these large, complex DNNs, reducing training time from months to only a few days. The major deep learning software frameworks have incorporated GPU acceleration, including Caffe, Torch7, Theano, and CUDA-Convnet2. Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, last year NVIDIA introduced cuDNN, a library of primitives for deep neural networks.

Today at the GPU Technology Conference, NVIDIA CEO and co-founder Jen-Hsun Huang introduced DIGITS, the first interactive Deep Learning GPU Training System. DIGITS is a new system for developing, training and visualizing deep neural networks. It puts the power of deep learning into an intuitive browser-based interface, so that data scientists and researchers can quickly design the best DNN for their data using real-time network behavior visualization. DIGITS is open-source software, available on GitHub, so developers can extend or customize it or contribute to the project.

Apologies for the delay in seeing Allison’s post but at least I saw it before the weekend!

In addition to a great write-up, Allison walks through how she has used DIGITS. In terms of “onboarding” to software, it doesn’t get any better than this.

What are you going to apply DIGITS to?

I first saw this in a tweet by Christian Rosnes.

### Split opens up on Capitol Hill over science funding

Friday, March 20th, 2015

From the post:

A conflict several years in the making between Republican leaders in Congress and US science agencies has reached boiling point. Science advocates and researchers that depend on government grants are particularly worried now that Republicans control both chambers of Congress. They fear that science budgets will be cut and the independence of research agencies curtailed.

Their concerns have been sparked by two simultaneous developments: increasing public criticism by key Republicans of research funded by agencies like the National Science Foundation (NSF) and a congressional power shift that has placed many vocal so-called climate change sceptics and opponents of environmental regulations in positions of power. This shift has been marked by a number of top Republicans publicly questioning the value of some research based on a reading of the grant’s title and abstract.

But the problem appears to go beyond mocking apparently silly-sounding research. ‘It is not only that politicians are making fun of scientific projects that sound outlandish or impractical, they are literally rejecting science in order to gain political advantage,’ says Sean Carroll, a theoretical physicist at the California Institute of Technology, US. This could have to do with pleasing campaign contributors or constituencies, he suggests.

‘There is an attack on the actual substance of the science being done in an attempt to limit the type of science that federal agencies can do because the results of that investigation would be politically inconvenient,’ says Will White, a marine biology professor at the University of North Carolina-Wilmington, US, who has received science agency grants in the past.

An important story and one where you have a direct interest in coming to the aid of science. Science has been generating important and big data for decades. Now it is about to step up its game and the political elites want to put their oar in.

Not that science was ever the genteel and clean pursuit of truth myth that we were taught in elementary school, on that see: The raw politics of science by Judith Curry. Judith isn’t talking about politics as with the government but inside science itself.

That’s important to remember because despite their experience with internal politics of science, scientists as a class don’t seem to get that pointing out politicians could be replaced by sump pumps isn’t a winning strategy.

Not that I disagree, in fact I had to invent the comparison to a “sump pump” to have something I was comfortable publishing on this blog. My actual opinion is a good deal more colorful and quite a bit less generous.

The other reason why scientists are at a disadvantage is that as a rule, politicians may have attended Ivy League colleges but they have a bar maid’s understanding of the world. The sole and only question of interest to a politician is what are you doing right now to further their interests. Well, or what it would take for you to further their interests.

That scientists may discover things unknown to their constituents, that may save the lives of their constituents, that may help reshape the future, but none of those are values in their universe. What matters are the uninformed opinions of people benighted enough to elect them to public office. So long as those opinions total up to be more than 50% of the voters in their district, what else is there to value?

None of what I have just said is new, original or surprising to anyone. You will hear politer and coarser versions of it as debates over the funding of science heats up.

NIH only so long as he has people to donate to him, volunteer for him, vote for him, do deals with him, etc. What if by using big data supporters of science could reach out to every cancer survivor who survived because of the NIH? Or reached out to the survivors who lost a loved one because NIH funded research found a cure too late due to budget cuts?

Do you think they would be as sympathetic to Rand Paul as before? When the blood on the knife in the back of the NIH is that of a family member? Somehow I doubt they would keep donating to Sen. Paul.

Won’t it be ironic if big data makes big government personal for everyone? Not just the top 1%.

Let’s use big data to make the 2016 election personal, very personal for everyone.

PS: I’m thinking about data sets that could be used together to create a “personal” interest in the 2016 elections. Suggestions more than welcome!

I first saw this in a tweet by Chemistry World.

### How to install Spark 1.2 on Azure HDInsight clusters

Friday, March 20th, 2015

How to install Spark 1.2 on Azure HDInsight clusters by Maxim Lukiyanov.

From the post:

Today we are pleased to announce the refresh of the Apache Spark support on Azure HDInsight clusters. Spark is available on HDInsight through custom script action and today we are updating it to support the latest version of Spark 1.2. The previous version supported version 1.0. This update also adds Spark SQL support to the package.

Spark 1.2 script action requires latest version of HDInsight clusters 3.2. Older HDInsight clusters will get previous version of Spark 1.0 when customized with Spark script action.

Follow the below steps to create Spark cluster using Azure Portal:

The only remaining questions are: How good are you with Spark? and How big of a Spark cluster do you neeed? (or can afford).

Enjoy!

### Tamr Catalog Tool (And Email Harvester)

Friday, March 20th, 2015

Tamr to Provide Free, Standalone Version of Tamr Catalog Tool

From the webpage:

Tamr Catalog was announced in February as part of the Tamr Platform for enterprise data unification. Using Tamr Catalog, enterprises can quickly inventory all the data that exists in the enterprise, regardless of type, platform or source. With today’s announcement of a free, standalone version of Tamr Catalog, enterprises can now speed and standardize data inventorying, making more data visible and readily usable for analytics.

Tamr Catalog is a free, standalone tool that allows businesses to logically map the attributes and records of a given data source with the entity it actually represents. This speeds time-to- analytics by reducing the amount of time spent searching for data.

That all sounds interesting but rather short on how the Tamr Catalog Tool will make that happen.

Download the whitepaper? Its all of two (2) pages long. Genuflects to 90% of data being dark, etc. but not a whisper on how the Tamr Catalog Tool will cure that darkness.

Let’s all hope they discover how to make the Tamr Catalog Tool perform these miracles before it is released this coming summer.

I do think the increasing interest in “dark data” bodes well for topic maps.

### New Bios Implant, …

Friday, March 20th, 2015

From the post:

When the National Security Agency’s ANT division catalog of surveillance tools was disclosed among the myriad of Snowden revelations, its desire to implant malware into the BIOS of targeted machines was unquestionable.

While there’s little evidence of BIOS bootkits in the wild, the ANT catalog and the recent disclosure of the Equation Group’s cyberespionage platform, in particular module NLS_933W.DLL that reprograms firmware from most major vendors, leave little doubt that attacks against hardware are leaving the realm of academics and white hats.

Tomorrow at the CanSecWest conference in Vancouver, researchers Corey Kallenberg and Xeno Kovah, formerly of MITRE and founders of startup LegbaCore, will deliver research on new BIOS vulnerabilities and present a working rootkit implant into BIOS.

“Most BIOS have protections from modifications,” Kallenberg told Threatpost. “We found a way to automate the discovery of vulnerabilities this space and break past those protections.”

Take good notes and blog extensively if you are at the conference. Please!

Certainly good news on the bios front. At least in the sense that the more insecure government computers are, the safer the rest of us are from government overreaching. Think of it as a parity of discovery/disclosure. When J Edgar wasn’t in drag, he had scores of agents to tap your phone.

Now, thanks to the Internet and gullible government employees, the question isn’t when will information leak but from who and about what? Still need more leaks but that’s a topic for a separate post.

The real danger lies in government becoming disproportionately secure vis-a-vis its citizens.

Details on CanSecWest.

### Detecting potential typos using EXPLAIN

Thursday, March 19th, 2015

Detecting potential typos using EXPLAIN by Mark Needham.

Mark illustrates use of EXPLAIN (in Neo4j 2.2.0 RC1) to detect typos (not potential, actual typos) to debug a query.

Now if I could just find a way to incorporate EXPLAIN into documentation prose.

PS: I say that in jest but using a graph model, it should be possible to create a path through documentation that highlights the context of a particular point in the documentation. Trivial example: I find “setting margin size” but don’t know how that relates to menus in an application. “Explain” in that context displays a graph with the nodes necessary to guide me to other parts of the documentation. Each of those nodes might have additional information at each of their “contexts.”

### Jump-Start Big Data with Hortonworks Sandbox on Azure

Thursday, March 19th, 2015

Jump-Start Big Data with Hortonworks Sandbox on Azure by Saptak Sen.

From the post:

We’re excited to announce the general availability of Hortonworks Sandbox for Hortonworks Data Platform 2.2 on Azure.

Hortonworks Sandbox is already a very popular environment in which developers, data scientists, and administrators can learn and experiment with the latest innovations in the Hortonworks Data Platform.

The hundreds of innovations span Hadoop, Kafka, Storm, Hive, Pig, YARN, Ambari, Falcon, Ranger, and other components of which HDP is composed. Now you can deploy this environment for your learning and experimentation in a few clicks on Microsoft Azure.

Follow the guide to Getting Started with Hortonworks Sandbox with HDP 2.2 on Azure to set up your own dev-ops environment on the cloud in a few clicks.

We also provide step by step tutorials to help you get a jump-start on how to use HDP to implement a Modern Data Architecture at your organization.

The Hadoop Sandbox is an excellent way to explore the Hadoop ecosystem. If you trash the setup, just open another sandbox.

Add Hortonworks tutorials to the sandbox and you are less likely to do something really dumb. Or at least you will understand what happened and how to avoid it before you go into production. Always nice to keep the dumb mistakes on your desktop.

Now the Hortonworks Sandbox is on Azure. Same safe learning environment but the power to scale when you are really to go live!

### UI Events (Formerly DOM Level 3 Events) Draft Published

Thursday, March 19th, 2015

UI Events (Formerly DOM Level 3 Events) Draft Published

From the post:

The Web Applications Working Group has published a Working Draft of UI Events (formerly DOM Level 3 Events). This specification defines UI Events which extend the DOM Event objects defined in DOM4. UI Events are those typically implemented by visual user agents for handling user interaction such as mouse and keyboard input. Learn more about the Rich Web Client Activity.

If you are planning on building rich web clients, now would be the time to start monitoring W3C drafts in this area. To make sure your use cases are met.

People have different expectations with regard to features and standards quality. Make sure your expectations are heard.

### GPU-Accelerated Graph Analytics in Python with Numba

Thursday, March 19th, 2015

Abstract:

Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.

By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.

Python enthusiasts, I would not take the “…from a glue language to a complete programming environment…” comment to heart.

The author also says:

Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.

and then summarizes the results of code in the post:

Our GPU PageRank implementation completed in just 163 seconds on the full graph of 623 million edges and 43 million nodes using a single NVIDIA Tesla K20 GPU accelerator. Our equivalent Numba CPU-JIT version took at least 5 times longer on a smaller graph.

plus points out techniques for optimizing the code.

I’d say no hard feelings. Yes? 😉

### Can recursive neural tensor networks learn logical reasoning?

Thursday, March 19th, 2015

Abstract:

Recursive neural network models and their accompanying vector representations for words have seen success in an array of increasingly semantically sophisticated tasks, but almost nothing is known about their ability to accurately capture the aspects of linguistic meaning that are necessary for interpretation or reasoning. To evaluate this, I train a recursive model on a new corpus of constructed examples of logical reasoning in short sentences, like the inference of “some animal walks” from “some dog walks” or “some cat walks,” given that dogs and cats are animals. This model learns representations that generalize well to new types of reasoning pattern in all but a few cases, a result which is promising for the ability of learned representation models to capture logical reasoning.

From the introduction:

Natural language inference (NLI), the ability to reason about the truth of a statement on the basis of some premise, is among the clearest examples of a task that requires comprehensive and accurate natural language understanding [6].

I stumbled over that line in Samuel’s introduction because it implies, at least to me, that there is a notion of truth that resides outside of ourselves as speakers and hearers.

Take his first example:

Consider the statement all dogs bark. From this, one can infer quite a number of other things. One can replace the first argument of all (the first of the two predicates following it, here dogs) with any more specific category that contains only dogs and get a valid inference: all puppies bark; all collies bark.

Contrast that with one the premises that starts my day:

All governmental statements are lies of omission or commission.

Yet, firmly holding that as a “fact” of the world, I write to government officials, post ranty blog posts about government policies, urge others to attempt to persuade government to take certain positions.

Or as Leonard Cohen would say:

Everybody knows that the dice are loaded

Everybody rolls with their fingers crossed

It’s not that I think Samuel is incorrect about monotonicity for “logical reasoning” but monotonicity is a far cry from how people reason day to day.

Rather than creating “reasoning” that is such a departure from human inference, why not train a deep learning system to “reason” by exposing it to the same inputs and decisions made by human decision makers? Imitation doesn’t require understanding of human “reasoning,” just the ability to engage in the same behavior under similar circumstances.

That would reframe Samuel’s question to read: Can recursive neural tensor networks learn human reasoning?

I first saw this in a tweet by Sharon L. Bolding.

### Should Topic Maps Gossip?

Wednesday, March 18th, 2015

Efficient Reconciliation and Flow Control for Anti-Entropy Protocols byRobbert van Renesse, Dan Dumitriu, Valient Gough and Chris Thomas.

Abstract:

The paper shows that anti-entropy protocols can process only a limited rate of updates, and proposes and evaluates a new state reconciliation mechanism as well as a flow control scheme for anti-entropy protocols.

Excuse the title, I needed a catchier line than the title of the original paper!

This is the Scuttlebutt paper that underlies Cassandra.

Rather than an undefined notion of consistency, ask yourself how much consistency is required by an application?

I first saw this in a tweet by Jason Brown.