## The Three Breakthroughs That Have Finally Unleashed AI on the World

November 27th, 2014

I was attracted to this post by a tweet from Diana Zeaiter Joumblat which read:

How parallel computing, big data & deep learning algos have put an end to the #AI winter

It has been almost a decade now but while riding to lunch with a doctoral student in computer science, they related how their department was known as “human-centered computing” because AI had gotten such a bad name. In their view, the AI winter was about to end.

I was quite surprised as I remembered the AI winter of the 1970’s.

The purely factual observations by Kevin in this article are all true, but I would not fret too much about:

As it does, this cloud-based AI will become an increasingly ingrained part of our everyday life. But it will come at a price. Cloud computing obeys the law of increasing returns, sometimes called the network effect, which holds that the value of a network increases much faster as it grows bigger. The bigger the network, the more attractive it is to new users, which makes it even bigger, and thus more attractive, and so on. A cloud that serves AI will obey the same law. The more people who use an AI, the smarter it gets. The smarter it gets, the more people use it. The more people that use it, the smarter it gets. Once a company enters this virtuous cycle, it tends to grow so big, so fast, that it overwhelms any upstart competitors. As a result, our AI future is likely to be ruled by an oligarchy of two or three large, general-purpose cloud-based commercial intelligences.

I am very doubtful of: “The more people who use an AI, the smarter it gets.”

As we have seen from the Michael Brown case, the more people who comment on a subject, the less is known about it. Or at least what is known gets lost is a tide of non-factual but stated as factual, information.

The assumption that the current AI boom will crash upon is the assumption that accurate knowledge can be obtained in all areas. Some, like chess, sure, that can happen. Do we know all the factors at play between the police and the communities they serve?

AIs can help with medicine, but considering what we don’t know about the human body and medicine, taking a statistical guess at the best treatment isn’t reasoning, it a better betting window.

I am all for pushing AIs where they are useful, but being ever mindful that it has no more operations than my father’s mechanical pocket calculator I remember as a child. Impressive but that’s not the equivalent of intelligence.

## A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX

November 27th, 2014

A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX by Kenny Bastani.

From the post:

I’ve just released a useful new Docker image for graph analytics on a Neo4j graph database with Apache Spark GraphX. This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. This docker image is a great addition to Neo4j if you’re looking to do easy PageRank or community detection on your graph data. Additionally, the results of the graph analysis are applied back to Neo4j.

This gives you the ability to optimize your recommendation-based Cypher queries by filtering and sorting on the results of the analysis.

This rocks!

If you were looking for an excuse to investigate Docker or Spark or GraphX or Neo4j, it has arrived!

Enjoy!

November 27th, 2014

To adhere to the legal side of the line, monitoring apps have to be marketed at employers who want to keep an eye on their workers, or guardians who want to watch over their kids.

\$500K is a pretty good pop at the start of the holiday season.

For further background on the story, see Lisa’s other story on this: Head of ‘StealthGenie’ mobile stalking app indicted for selling spyware and the Federal proceedings proper.

## Neo4j 2.1.6 (release)

November 26th, 2014

Neo4j 2.1.6 (release)

From the post:

Neo4j 2.1.6 is a maintenance release, with critical improvements.

Notably, this release:

• Resolves a critical shutdown issue, whereby IO errors were not always handled correctly and could result in inconsistencies in the database due to failure to flush outstanding changes.
• Significantly reduce the file handle requirements for the lucene based indexes.
• Resolves an issue in consistency checking, which could falsely report store inconsistencies.
• Extends the Java API to allow the degree of a node to be easily obtained (the count of relationships, by type and direction).
• Resolves a significant performance degradation that affected the loading of relationships for a node during traversals.
• Resolves a backup issue, which could result in a backup store that would not load correctly into a clustered environment (Neo4j Enterprise).
• Corrects a clustering issue that could result in the master failing to resume its role after an outage of a majority of slaves (Neo4j Enterprise).

All Neo4j 2.x users are recommended to upgrade to this release. Upgrading to Neo4j 2.1, from Neo4j 1.9.x or Neo4j 2.0.x, requires a migration to the on-disk store and can not be reversed. Please ensure you have a valid backup before proceeding, then use on a test or staging server to understand any changed behaviors before going into production.

Neo4j 1.9 users may upgrade directly to this release, and are recommended to do so carefully. We strongly encourage verifying the syntax and validating all responses from your Cypher scripts, REST calls, and Java code before upgrading any production system. For information about upgrading from Neo4j 1.9, please see our Upgrading to Neo4j 2 FAQ.

For a full summary of changes in this release, please review the CHANGES.TXT file contained within the distribution.

As with all software upgrades, do not delay until the day before you are leaving on holiday!

## Spark, D3, data visualization and Super Cow Powers

November 26th, 2014

Spark, D3, data visualization and Super Cow Powers by Mateusz Fedoryszak.

From the post:

Did you know that the amount of milk given by a cow depends on the number of days since its last calving? A plot of this correlation is called a lactation curve. Read on to find out how do we use Apache Spark and D3 to find out how much milk we can expect on a particular day.

There are things that except for a client’s request, I have never been curious about.

How are you using Spark?

I first saw this in a tweet by Anna Pawlicka

## NSA partners with Apache to release open-source data traffic program

November 25th, 2014

NSA partners with Apache to release open-source data traffic program by Steven J. Vaughan-Nichols.

From the post:

Many of you probably think that the National Security Agency (NSA) and open-source software get along like a house on fire. That's to say, flaming destruction. You would be wrong.

In partnership with the Apache Software Foundation, the NSA announced on Tuesday that it is releasing the source code for Niagarafiles (Nifi). The spy agency said that Nifi "automates data flows among multiple computer networks, even when data formats and protocols differ".

Details on how Nifi does this are scant at this point, while the ASF continues to set up the site where Nifi's code will reside.

In a statement, Nifi's lead developer Joseph L Witt said the software "provides a way to prioritize data flows more effectively and get rid of artificial delays in identifying and transmitting critical information".

I don’t doubt the NSA efforts at open source software. That isn’t saying anything about how closely the code would need to be proofed.

Perhaps encouraging more open source projects from the NSA will eat into the time they have to spend writing malware.

Something to look forward to!

## Falsehoods Programmers Believe About Names

November 25th, 2014

Falsehoods Programmers Believe About Names by Patrick McKenzie.

From the post:

John Graham-Cumming wrote an article today complaining about how a computer system he was working with described his last name as having invalid characters. It of course does not, because anything someone tells you is their name is — by definition — an appropriate identifier for them. John was understandably vexed about this situation, and he has every right to be, because names are central to our identities, virtually by definition.

I have lived in Japan for several years, programming in a professional capacity, and I have broken many systems by the simple expedient of being introduced into them. (Most people call me Patrick McKenzie, but I’ll acknowledge as correct any of six different “full” names, any many systems I deal with will accept precisely none of them.) Similarly, I’ve worked with Big Freaking Enterprises which, by dint of doing business globally, have theoretically designed their systems to allow all names to work in them. I have never seen a computer system which handles names properly and doubt one exists, anywhere.

So, as a public service, I’m going to list assumptions your systems probably make about names. All of these assumptions are wrong. Try to make less of them next time you write a system which touches names.

McKenzie has an admittedly incomplete list of forty (40) myths for people’s names.

If there are that many for people’s names, I wonder what the count is for all other subjects?

Including things on the Internet of Things?

I first saw this in a tweet by OnePaperPerDay.

## Announcing Apache Pig 0.14.0

November 25th, 2014

Announcing Apache Pig 0.14.0 by Daniel Dai.

From the post:

With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. Apache Tez supports YARN-based, high performance batch and interactive data processing applications in Hadoop that need to handle datasets scaling to terabytes or petabytes.

The Apache community just released Apache Pig 0.14.0,and the main feature is Pig on Tez. In this release, we closed 334 Jira tickets from 35 Pig contributors. Specific credit goes to the virtual team consisting of Cheolsoo Park, Rohini Palaniswamy, Olga Natkovich, Mark Wagner and Alex Bain who were instrumental in getting Pig on Tez working!

This blog gives a brief overview of Pig on Tez and other new features included in the release.

Pig on Tez

Apache Tez is an alternative execution engine focusing on performance. It offers a more flexible interface so Pig can compile into a better execution plan than is possible with MapReduce. The result is consistent performance improvements in both large and small queries.

Since it is the Thanksgiving holiday this week in the United States, this release reminds me to ask why is turkey the traditional Thanksgiving meal? Everyone likes bacon better.

## Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support

November 25th, 2014

Bye-bye Giraph-Gremlin, Hello Hadoop-Gremlin with GiraphGraphComputer Support by Marko A. Rodriguez.

There are days when I wonder if Marko ever sleeps or if the problem of human cloning has already been solved.

This is one of those day:

The other day Dan LaRocque and I were working on a Hadoop-based GraphComputer for Titan so we could do bulk loading into Titan. First we wrote the BulkLoading VertexProgram:
…and then realized, “huh, we can just execute this with GiraphGraph. Huh! We can just execute this with TinkerGraph!” In fact, as a side note, the BulkLoaderVertexProgram is general enough to work for any TinkerPop Graph.
https://github.com/tinkerpop/tinkerpop3/issues/319

So great, we can just use GiraphGraph (or any other TinkerPop implementation that has a GraphComputer (e.g. TinkerGraph)). However, Titan is all about scale and when the size of your graph is larger than the total RAM in your cluster, we will still need a MapReduce-based GraphComputer. Thinking over this, it was realized: Giraph-Gremlin is very little Giraph and mostly just Hadoop — InputFormats, HDFS interactions, MapReduce wrappers, Configuration manipulations, etc. Why not make GiraphGraphComputer just a particular GraphComputer supported by Gremlin-Hadoop (a new package).

With that, Giraph-Gremlin no longer exists. Hadoop-Gremlin now exists. Hadoop-Gremlin behaves the exact same way as Giraph-Gremlin, save that we will be adding a MapReduceGraphComputer to Hadoop-Gremlin. In this way, Hadoop-Gremlin will support two GraphComputer: GiraphGraphComputer and MapReduceGraphComputer.

The master/ branch is updated and the docs for Giraph have been re-written, though I suspect there will be some dangling references in the docs here and there for a while.

Up next, Matthias and I will create MapReduceGraphComputer that is smart about “partitioned vertices” — so you don’t get the Faunus scene where if a vertex doesn’t fit in memory, an exception. This will allow vertices with as many edges as you want (though your data model is probably shotty if you have 100s of millions of edges on one vertex ……………….. Matthias will be driving that effort and I’m excited to learn about the theory of vertex partitioning (i.e. splitting a single vertex across machines).

Enjoy!

## Ferguson Municipal Public Library

November 25th, 2014

Ferguson Municipal Public Library

Ashley Ford tweeted that donations should be made to the Ferguson Municipal Public Library.

While schools are closed in Ferguson, the library has stayed open and has been a safe refuge.

Support the Ferguson Municipal Public Library as well as your own.

Libraries are where our tragedies, triumphs, and history live on for future generations.

## Treasury Island: the film

November 25th, 2014

Treasury Island: the film by Lauren Willmott, Boyce Keay, and Beth Morrison.

From the post:

We are always looking to make the records we hold as accessible as possible, particularly those which you cannot search for by keyword in our catalogue, Discovery. And we are experimenting with new ways to do it.

The Treasury series, T1, is a great example of a series which holds a rich source of information but is complicated to search. T1 covers a wealth of subjects (from epidemics to horses) but people may overlook it as most of it is only described in Discovery as a range of numbers, meaning it can be difficult to search if you don’t know how to look. There are different processes for different periods dating back to 1557 so we chose to focus on records after 1852. Accessing these records requires various finding aids and multiple stages to access the papers. It’s a tricky process to explain in words so we thought we’d try demonstrating it.

We wanted to show people how to access these hidden treasures, by providing a visual aid that would work in conjunction with our written research guide. Armed with a tablet and a script, we got to work creating a video.

Our remit was:

• to produce a video guide no more than four minutes long
• to improve accessibility to these records through a simple, step-by–step process
• to highlight what the finding aids and documents actually look like

These records can be useful to a whole range of researchers, from local historians to military historians to social historians, given that virtually every area of government action involved the Treasury at some stage. We hope this new video, which we intend to be watched in conjunction with the written research guide, will also be of use to any researchers who are new to the Treasury records.

Adding video guides to our written research guides are a new venture for us and so we are very keen to hear your feedback. Did you find it useful? Do you like the film format? Do you have any suggestions or improvements? Let us know by leaving a comment below!

This is a great illustration that data management isn’t something new. The Treasury Board has kept records since 1557 and has accumulated a rather extensive set of materials.

The written research guide looks interesting but since I am very unlikely to ever research Treasury Board records, I am unlikely to need it.

However, the authors have anticipated that someone might be interested in process of record keeping itself and so provided this additional reference:

Thomas L Heath, The Treasury (The Whitehall Series, 1927, GP Putnam’s Sons Ltd, London and New York)

That would be an interesting find!

I first saw this in a tweet by Andrew Janes.

## Datomic 0.9.5078 now available

November 25th, 2014

Datomic 0.9.5078 now available by Ben Kamphaus.

From the post:

This message covers changes in this release. For a summary of critical release notices, see http://docs.datomic.com/release-notices.html.

The Datomic Team recommends that you always take a backup before adopting a new release.

## Changed in 0.9.5078

• New CloudWatch metrics: WriterMemcachedPutMusec, WriterMemcachedPutFailedMusec ReaderMemcachedPutMusec and ReaderMemcachedPutFailedMusec track writes to memcache. See http://docs.datomic.com/caching.html#memcached
• Improvement: Better startup performance for databases using fulltext.
• Improvement: Enhanced the Getting Started examples to include the Pull API and find specifications.
• Improvement: Better scheduling of indexing jobs during bursty transaction volumes
• Fixed bug where Pull API could incorrectly return renamed attributes.
• Fixed bug that caused db.fn/cas to throw an exception when false was passed as the new value.

In case you haven’t walked through Datomic, you really should.

Here is one example why:

Next download the subset of the mbrainz database covering the period 1968-1973 (which the Datomic team has scientifically determined as being the most important period in the history of recorded music): [From: https://github.com/Datomic/mbrainz-sample]

Truer words were never spoken!

Enjoy!

## New York Times API extractor and Google Maps visualization (Wandora Tutorial)

November 25th, 2014

New York Times API extractor and Google Maps visualization (Wandora Tutorial)

From the description:

Video reviews the New York Times API extractor, the Google Maps visualization, and the graph visualization of Wandora application. The extractor is used to collect event data which is then visualized on a map and as a graph. Wandora is an open source tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. For more information see http://wandora.org

This is impressive, although the UI may have more options than MS Word. (It may not, I haven’t counted every way to access every option.)

Here is the result that was obtained by use of drop down menus and selecting:

The Times logo marks events extracted from the New York Times and merged for display with Google Maps.

Not technically difficult but it is good to see a function of interest to ordinary users in a topic map application.

I have the latest release of Wandora. Need to start walking through the features.

## Documents Released in the Ferguson Case

November 25th, 2014

Documents Released in the Ferguson Case (New York Times)

The New York Times has posted the following documents from the Ferguson case:

• 24 Volumes of Grand Jury Testimony
• 30 Interviews of Witnesses by Law Enforcement Officials
• 23 Forensic and Other Reports
• 254 Photographs

Assume you are interested in organizing these materials for rapid access and cross-linking between them.

1. Accessing Grand Jury Testimony by volume and page number?
2. Accessing Interviews of Witnesses by report and page number?
3. Linking people to reports, testimony and statements?
6. Linking Forensic reports to witness statements and/or testimony?
7. Linking physical evidence into witness statements and/or testimony?
8. Others?

It’s a lot of material so which requirements, these or others, would be your first priority?

It’s not a death march project but on the other hand, you need to get the most valuable tasks done first.

Suggestions?

## The Sight and Sound of Cybercrime

November 25th, 2014

The Sight and Sound of Cybercrime Office for Creative Research.

From the post:

You might not personally be in the business of identity theft, spam delivery, or distributed hacking, but there’s a decent chance that your computer is. “Botnets” are criminal networks of computers that, unbeknownst to their owners, are being put to use for any number of nefarious purposes. Across the globe, millions of PCs have been infected with software that conscripts them into one of these networks, silently transforming these machines into accomplices in illegal activities and putting their users’ information at risk.

Microsoft’s Digital Crimes Unit has been tracking and neutralizing these threats for several years. In January, DCU asked The Office for Creative Research to explore novel ways to visualize botnet activity. The result is Specimen Box, a prototype exploratory tool that allows DCU’s investigators to examine the unique profiles of various botnets, focusing on the geographic and time-based communication patterns of millions of infected machines.

Specimen Box enables investigators to study a botnet the way a naturalist might examine a specimen collected in the wild: What are its unique characteristics? How does it behave? How does it propagate itself? How is it adapting to a changing environment?

Specimen Box combines visualization and sonification capabilities in a large-screen, touch-based application. Investigators can see and hear both live activity and historical ‘imprints’ of daily patterns across a set of 15 botnets. Because every botnet has its own unique properties, the visual and sonic portraits generated by the tool offer insight into the character of each individual network.

Very impressive graphic capabilities with several short video clips.

Would have been more impressive if the viewer was clued in on what the researchers were attempting to discover in the videos.

One point that merits special mention:

By default, the IP addresses are sorted around the circle by the level of communication activity. The huge data set has been optimized to allow researchers to instantly re-sort the IPs by longitude or by similarity. “Longitude Sort Mode” arranges the IPs geographically from east to west, while “Similarity Sort Mode” groups together IPs that have similar activity patterns over time, allowing analysts to see which groups of machines within the botnet are behaving the same way. These similarity clusters may represent botnet control groups, research activity from universities or other institutions, or machines with unique temporal patterns such as printers.

Think of “Similarity Sort Mode” as a group subject and this starts to resemble display of topics that have been merged* according to different criteria, in response to user requests.

*By “merged” I mean displayed as though “merged” in the TMDM sense of operations on a file.

## Wandora 2014-11-24

November 24th, 2014

Wandora 2014-11-24

From the homepage:

New Wandora release (2014-11-24) features Watson translation API support, Alchemy face detection API extractor, enhanced occurrence view in Traditional topic panel. The release adds Spanish, German and French as a default languages for topic occurrences and names. The release contains numerous smaller enhancements and fixes.

If you don’t know Wandora:

Wandora is a tool for people who collect and process information, especially networked knowledge and knowledge about WWW resources. With Wandora you can aggregate and combine information from various different sources. You can manipulate the collected knowledge flexible and efficiently, and without programming skills. More generally speaking Wandora is a general purpose information extraction, management and publishing application based on Topic Maps and Java. Wandora suits well for constructing and maintaining vocabularies, ontologies and information mashups. Application areas include linked data, open data, data integration, business intelligence, digital preservation and data journalism. Wandora’s license is GNU GPL. Wandora application is developed actively by a small number of experienced software developers. We call ourselves as the Wandora Team.

The download zip file has the data of the release in its name, making it easy to keep multiple versions of Wandora on one machine. You can try a new release without letting go of your current one. Thanks Wandora team!

## “Groundbreaking” state spyware targeted airlines and energy firms

November 24th, 2014

From the post:

The security firm Symantec has detailed a highly sophisticated piece of spyware called Regin, which it reckons is probably a key intelligence-gathering tool in a nation state’s digital armory. Its targets have included individuals, small businesses, telecommunications firms, energy firms, airlines, research institutes and government agencies.

In a whitepaper, Symantec described Regin as “groundbreaking and almost peerless.” Regin comprises six stages, each triggered by the last, with each (barring the initial infection stage) remaining encrypted until called upon by the last. It can deploy modules that are “tailored to the target.” According to the firm, it was used between 2008 and 2011, when it disappeared before a new version appeared in 2013.

See David’s post for the details and the whitepaper by Symantec for even more details, including detection of infection.

Suspects?

I can’t speak for anyone other than myself but if governments want their citizens to live in a fishbowl, turnabout seems like fair play.

Wouldn’t it be interesting to see non-governmental Regin-like spyware that operated autonomously and periodically dumped collected data to random public upload sites?

## Friedrich Nietzsche and his typewriter – a Malling-Hansen Writing Ball

November 24th, 2014

Friedrich Nietzsche and his typewriter – a Malling-Hansen Writing Ball

From the webpage:

The most prominent owner of a writing ball was probably the German philosopher, Friedrich Nietzsche (1844-1900). In 1881, when he was almost blind, Nietzsche wanted to buy a typewriter to enable him to continue his writing, and from letters to his sister we know that he personally was in contact with “the inventor of the typewriter, Mr Malling-Hansen from Copenhagen”. He mentioned to his sister that he had received letters and also a typewritten postcard as an example.

Nietzsche received his writing ball in 1882. It was the newest model, the portable tall one with a colour ribbon, serial number 125, and several typescripts are known to have been written by him on this writing ball. We know that Nietzsche was also familiar with the newest Remington typewriter (model 2), but as he wanted to buy a portable typewriter, he chose to buy the Malling-Hansen writing ball, as this model was lightweight and easy to carry — one might say that it was the “laptop” of that time.

Unfortunately Nietzsche wasn’t totally satisfied with his purchase and never really mastered the use of the instrument. Until now, many people have tried to understand why Nietzsche did not make more use of it, and a number of theories have been suggested such as that it was an outdated and poor model, that it was possible to write only upper case letters, etc. Today we can say for certain that all this is only speculation without foundation.

The writing ball was a solidly constructed instrument, made by hand and equipped with all the features one would expect of a modern typewriter.

You can now read the details about the Nietzsche writing ball in a book, “Nietzches Schreibkigel”, by Dieter Eberwein, vice-president of the International Rasmus Malling-Hansen Society, published by “Typoscript Verlag”. In it, Eberwein tells the true story about Nietzche’s writing ball based upon thorough investigation and restoration of the damaged machine.

If you think of Nietzsche‘s typing ball as an interface, it is certainly different from the keyboards of today.

I am not sure I could re-learn the “home” position for my fingers but certainly would be willing to give it a try.

Not as far fetched as you might think, a typing ball. Matt Adereth posted this image of a prototype typing ball:

Where would you put the “nub” and “buttons” for a pointing device? Curious about the ergonomics. If anyone decides to make prototypes, put my name down as definitely interested.

I saw this earlier today in a tweet by Vincent Zimmer although I already aware of
Nietzsche’s typing ball.

## Clojure is still not for geniuses

November 24th, 2014

Clojure is still not for geniuses (You — yes, you, dummy — could be productive in Clojure today.) by Adam Bard.

From the post:

The inspiration for the article I wrote last week entitled Clojure is not for geniuses was inspired by Tommy Hall‘s talk at Euroclojure 2014, wherein he made an offhand joke about preferring Clojure for its minimal syntax, as he possesses a small brain (both his blog and his head suggest this assertion is false). I had intended to bring this up with the original article, but got sidetracked talking about immutable things and never got back around to it. Here I’d like to address that, along with some discussion that arose in various forums after the first article.

This article is not about how Clojure is great. I mean, it is, but I’d like to focus on the points that make it an accessible and practical language, without any faffing about with homoiconicity and macros and DSLs and all that.

Today’s all about illustrating some more ways in which I believe our good comrade Clojure can appeal to and empower the proletariat in ways that certain other languages can’t, through the power of simplicity.

This is a great post but I would like to add something to:

So why isn’t everyone using it?

That’s the big question. Clojure has grown immensely in popularity, but it’s still not a household name. There are a lot of reasons for that – mainstreamn languages have been around a lot longer, naturally, and obviously people are still producing software in them.

That’s not a big question. Think about the years people have invested in C, COBOL, Fortran, C++ and ask yourself: Do I prefer programming where I am comfortable or do I prefer something new and not familiar. Be honest now.

The other thing to consider is the ongoing investment in programs written in C/C++, COBOL, etc. Funders don’t find risk of transition all that attractive, even if a new language has “cool” features. They are interested in results, not how you got them.

The universe of programs needs to expand to create space for Clojure to gain marketshare. The demand for concurrency is a distinct possibility. The old software markets will remain glutted with C/C++, etc., for the foreseeable future. But that’s ok, older programmers need something to fall back on.

Pressing forward on Clojure’s strengths, such as simplicity and concurrency and producing results that other current languages can’t match is the best way to increase Clojure’s share of an expanding market. (Or to put it in the negative, who wants to worry about a non-concurrent and slowly dying market?)

## Announcing Apache Hive 0.14

November 24th, 2014

Announcing Apache Hive 0.14 by Gunther Hagleitner.

From the post:

While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. Although many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate.

Last week, the Apache Hive community released Apache Hive 0.14, which includes the results of the first phase in the Stinger.next initiative and takes Hive beyond its read-only roots and extends it with ACID transactions. Thirty developers collaborated on this version and resolved more than 1,015 JIRA issues.

Although there are many new features in Hive 0.14, there are a few highlights we’d like to highlight. For the complete list of features, improvements, and bug fixes, see the release notes.

If you have been watching the work on Spark + Hive: Apache Hive on Apache Spark: The First Demo, then you know how important Hive is to the Hadoop ecosystem.

The highlights:

Transactions with ACID semantics (HIVE-5317)

Allows users to modify data using insert, update and delete SQL statements. This provides snapshot isolation and uses locking for writes. Now users can make corrections to fact tables and changes to dimension tables.

Cost Base Optimizer (CBO) (HIVE-5775)

Now the query compiler uses a more sophisticated cost based optimizer that generates query plans based on statistics on data distribution. This works really well with complex joins and joins with multiple large fact tables. The CBO generates busy plans that execute much faster.

SQL Temporary Tables (HIVE-7090)

Temporary tables exist in scratch space that goes away when the user session disconnects. This allows users and BI tools to store temporary results and further process that data with multiple queries.

Coming Next in Stinger.next: Sub-Second Queries

After Hive 0.14, we’re planning on working with the community to deliver sub-second queries and SQL:2011 Analytics coverage in Hive. We also plan to work on Hive-Spark integration for machine learning and operational reporting with Hive streaming ingest and transactions.

Hive is an example of how an open source project should be supported.

## Writing an R package from scratch

November 24th, 2014

Writing an R package from scratch by Hilary Parker.

From the post:

As I have worked on various projects at Etsy, I have accumulated a suite of functions that help me quickly produce tables and charts that I find useful. Because of the nature of iterative development, it often happens that I reuse the functions many times, mostly through the shameful method of copying the functions into the project directory. I have been a fan of the idea of personal R packages for a while, but it always seemed like A Project That I Should Do Someday and someday never came. Until…

Etsy has an amazing week called “hack week” where we all get the opportunity to work on fun projects instead of our regular jobs. I sat down yesterday as part of Etsy’s hack week and decided “I am finally going to make that package I keep saying I am going to make.” It took me such little time that I was hit with that familiar feeling of the joy of optimization combined with the regret of past inefficiencies (joygret?). I wish I could go back in time and create the package the first moment I thought about it, and then use all the saved time to watch cat videos because that really would have been more productive.

This tutorial is not about making a beautiful, perfect R package. This tutorial is about creating a bare-minimum R package so that you don’t have to keep thinking to yourself, “I really should just make an R package with these functions so I don’t have to keep copy/pasting them like a goddamn luddite.” Seriously, it doesn’t have to be about sharing your code (although that is an added benefit!). It is about saving yourself time. (n.b. this is my attitude about all reproducibility.)

A reminder that well organized functions, like documentation, can be a benefit to its creator as well as others.

Organization: It’s not just for the benefit of others.

I try to not leave myself cryptic or half-written notes anymore.

## rvest: easy web scraping with R

November 24th, 2014

rvest: easy web scraping with R

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.

Great overview of rvest and its use for web scraping in R.

Axiom: You will have web scraping with you always. Not only because we are lazy, but disorderly to boot.

At CRAN: http://cran.r-project.org/web/packages/rvest/index.html (Author: Hadley Wickham)

## Jean Yang on An Axiomatic Basis for Computer Programming

November 24th, 2014

From the description:

Our lives now run on software. Bugs are becoming not just annoyances for software developers, but sources of potentially catastrophic failures. A careless programmer mistake could leak our social security numbers or crash our cars. While testing provides some assurance, it is difficult to test all possibilities in complex systems–and practically impossible in concurrent systems. For the critical systems in our lives, we should demand mathematical guarantees that the software behaves the way the programmer expected.

A single paper influenced much of the work towards providing these mathematical guarantees. C.A.R. Hoare’s seminal 1969 paper “An Axiomatic Basis for Computer Programming” introduces a method of reasoning about program correctness now known as Hoare logic. In this paper, Hoare provides a technique that 1) allows programmers to express program properties and 2) allows these properties to be automatically checked. These ideas have influenced decades of research in automated reasoning about software correctness.

In this talk, I will describe the main ideas in Hoare logic, as well as the impact of these ideas. I will talk about my personal experience using Hoare logic to verify memory guarantees in an operating system. I will also discuss takeaway lessons for working programmers.

The slides are impressive enough! I will be updating this post to include a pointer to the video when posted.

How important is correctness of merging in topic maps?

If you are the unfortunate individual whose personal information includes an incorrectly merged detail describing you as a terrorist, correctness of merging may be very important, at least to you.

The same would be true for information systems containing arrest warrants, bad credit information, incorrect job histories, education records, and banking records, just to mention a few.

What guarantees can you provide clients concerning merging of data in your topic maps?

Or is that the client and/or victim’s problem?

## How to Make a Better Map—Using Neuroscience

November 24th, 2014

How to Make a Better Map—Using Neuroscience by Laura Bliss.

From the post:

The neuroscience of navigation has been big news lately. In September, Nobel Prizes went to the discoverers of place cells and grid cells, the neurons responsible for our mental maps and inner GPS. That’s on top of an ever-growing pile of fMRI research, where scientists connect regions of the brain to specific navigation processes.

But the more we learn about how our bodies steer from A to B, are cartographers and geographers listening up? Is the science of wayfinding finding its way into the actual maps we use?

It’s beginning to. CityLab spoke to three prominent geographers who are thinking about the perceptual, cognitive, and neurological processes that go on when a person picks up a web of lines and words and tries to use it—or, the emerging science of map-making.

The post tackles questions like:

How do users make inferences from the design elements on a map, and how can mapmakers work to make their maps more perceptually salient?

But her current research looks at not just how the brain correlates visual information with thematic relevance, but how different kinds of visualization actually affect decision-making.

“I’m not interested in mapping the human brain,” she says. “A brain area in itself is only interesting to me if it can tell me something about how someone is using a map. And people use maps really differently.”

Ready to put your map design on more than an ad hoc basis? No definite answers in Laura’s post but several pointers towards exploration yet to be done.

I first saw this in a tweet by Greg Miller.

## It seemed like a good idea at the time

November 24th, 2014

It seemed like a good idea at the time by Tessa Thornton.

From the post:

I was reading through some of the on-boarding docs my first day at Shopify, and came across a reference to something called the “Retrospective Prime Directive”, which really appealed to me (initially because Star Trek):

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

This made me think of something I’ve been reminding myself of a lot over the past year, which started out as a joke but I’ve come to think of it as my own directive when it comes to reading other people’s code: It probably seemed like a good idea at the time.

Tessa’s point that understanding what someone was trying to accomplish, as opposed to mocking their efforts, is more productive is useful.

Not only can it lead to deeper understanding of the problem but you won’t waste time complaining about your predecessors being idiots.

I first saw this in a tweet by Julie Evans.

## The Debunking Handbook

November 23rd, 2014

The Debunking Handbook by John Cook, Stephan Lewandowsky.

From the post:

The Debunking Handbook, a guide to debunking misinformation, is now freely available to download. Although there is a great deal of psychological research on misinformation, there’s no summary of the literature that offers practical guidelines on the most effective ways of reducing the influence of myths. The Debunking Handbook boils the research down into a short, simple summary, intended as a guide for communicators in all areas (not just climate) who encounter misinformation.

The Handbook explores the surprising fact that debunking myths can sometimes reinforce the myth in peoples’ minds. Communicators need to be aware of the various backfire effects and how to avoid them, such as:

It also looks at a key element to successful debunking: providing an alternative explanation. The Handbook is designed to be useful to all communicators who have to deal with misinformation (eg – not just climate myths).

I think you will find this a delightful read! From the first section, titled: Debunking the first myth about debunking,

It’s self-evident that democratic societies should base their decisions on accurate information. On many issues, however, misinformation can become entrenched in parts of the community, particularly when vested interests are involved.1,2 Reducing the influence of misinformation is a difficult and complex challenge.

A common misconception about myths is the notion that removing its influence is as simple as packing more information into people’s heads. This approach assumes that public misperceptions are due to a lack of knowledge and that the solution is more information – in science communication, it’s known as the “information deficit model”. But that model is wrong: people don’t process information as simply as a hard drive downloading data.

Refuting misinformation involves dealing with complex cognitive processes. To successfully impart knowledge, communicators need to understand how people process information, how they modify
their existing knowledge and how worldviews affect their ability to think rationally. It’s not just what people think that matters, but how they think.

I would have accepted the first sentence had it read: It’s self-evident that democratic societies don’t base their decisions on accurate information.

I don’t know of any historical examples of democracies making decisions on accurate information.

For example, there are any number of “rational” and well-meaning people who have signed off on the “war on terrorism” as though the United States is in any danger.

Deaths from terrorism in the United States since 2001 – fourteen (14).

Deaths by entanglement in bed sheets between 2001-2009 – five thousand five hundred and sixty-one (5561).

Despite being a great read, Debunking has a problem, it presumes you are dealing with a “rational” person. Rational as defined by…, as defined by what? Hard to say. It is only mentioned once and I suspect “rational” means that you agree with debunking the climate “myth.” I do as well but that’s happenstance and not because I am “rational” in some undefined way.

Realize that “rational” is a favorable label people apply to themselves and little more than that. It rather conveniently makes anyone who disagrees with you “irrational.”

I prefer to use “persuasion” on topics like global warming. You can use “facts” for people who are amenable to that approach but also religion (stewarts of the environment), greed (exploitation of the Third World for carbon credits), financial interest in government funded programs, or whatever works to persuade enough people to support your climate change program. Being aware that other people with other agendas are going to be playing the same game. The question is whether you want to be “rational” or do you want to win?

Personally I am convinced of climate change and our role in causing it. I am also aware of the difficulty of sustaining action by people with an average attention span of fifteen (15) seconds over the period of the fifty (50) years it will take for the environment to stabilize if all human inputs stopped tomorrow. It’s going to take far more than “facts” to obtain a better result.

## Pride & Prejudice & Word Embedding Distance

November 23rd, 2014

Pride & Prejudice & Word Embedding Distance by Lynn Cherny.

From the webpage:

An experiment: Train a word2vec model on Jane Austen’s books, then replace the nouns in P&P with the nearest word in that model. The graph shows a 2D t-SNE distance plot of the nouns in this book, original and replacement. Mouse over the blue words!

In her blog post, Visualizing Word Embeddings in Pride and Prejudice, Lynn explain more about the project and the process she followed.

From that post:

Overall, the project as launched consists of the text of Pride and Prejudice, with the nouns replaced by the most similar word in a model trained on all of Jane Austen’s books’ text. The resulting text is pretty nonsensical. The blue words are the replaced words, shaded by how close a “match” they are to the original word; if you mouse over them, you see a little tooltip telling you the original word and the score.

I don’t agree that: “The resulting test is pretty nonsensical.”

True, it’s not Jane Austin’s original text and it is challenging to read, but that may be because our assumptions about Pride and Prejudice and literature in general are being defeated by the similar word replacements.

The lack of familiarity and smoothness of a received text may (no guarantees) enable us to see the text differently than we would on a casual re-reading.

What novel corpus would you use for such an experiment?

## …ambiguous phrases in research papers…

November 23rd, 2014

When scientists use ambiguous phrases in research papers… And what they might actually mean

This graphic was posted to Twitter by Jan Lentzos.

This sort of thing makes the rounds every now and again. From the number of retweets of Jan’s post, it never fails to amuse.

Enjoy!

## Visual Classification Simplified

November 23rd, 2014

Visual Classification Simplified

From the post:

Virtually all information governance initiatives depend on being able to accurately and consistently classify the electronic files and scanned documents being managed. Visual classification is the only technology that classifies both types of documents regardless of the amount or quality of text associated with them.

From the user perspective, visual classification is extremely easy to understand and work with. Once documents are collected, visual classification clusters or groups documents based on their appearance. This normalizes documents regardless of the types of files holding the content. The Word document that was saved to PDF will be grouped with that PDF and with the TIF that was made from scanning a paper copy of either document.

The clustering is automatic, there are no rules to write up front, no exemplars to select, no seed sets to try to tune. This is what a collection of documents might look like before visual classification is applied – no order and no way to classify the documents:

When the initial results of visual classification are presented to the client, the clusters are arranged according to the number of documents in each cluster. Reviewing the first clusters impacts the most documents. Based on reviewing one or two documents per cluster, the reviewer is able to determine (a) should the documents in the cluster be retained, and (b) if they should be retained, what document-type label to associate with the cluster.

By easily eliminating clusters that have no business or regulatory value, content collections can be dramatically reduced. Clusters that remain can have granular retention policies applied, be kept under appropriate access restrictions, and can be assigned business unit owners. Plus of course, the document-type labels can greatly assist users trying to find specific documents. (emphasis in original)

I suspect that BeyondRecognition, the host of this post, really means classification at the document level. A granularity that has plagued information retrieval for decades. Better than no retrieval at all but only just.

However, the graphics of visualization were just too good to pass up! Imagine that you are selecting merging criteria for a set of topics that represent subjects at a far lower granularity than document level.

With the results of those selections being returned to you as part of an interactive process.

If most topic map authoring is for aggregation, that is you author so that topics will merge, this would be aggregation by selection.

Hard to say for sure but I suspect that aggregation (merging) by selection would be far easier than authoring for aggregation.

Suggestions on how to test that premise?