Archive for June, 2015

The Chaos Ladder

Tuesday, June 30th, 2015

The Chaos Ladder – A visualization of Game of Thrones character appearences by Patrick Gillespie

From the webpage:

What is this?

A visualization of character appearences on HBO’s Game of Thrones TV series.

  • Hover over a character to get more information.
  • Slide the timeline to see how things have changed over time. You can do this with your mouse or the arrow keys on your keyboard.

If you prefer something a bit more entertaining for the long holiday weekend, check out this visualization of characters from the Game of Thrones. on HBO. (Personally I prefer the book version.)

There are a number of modeling challenges in this tale. For example, how would you model the various relationships of Cersei Lannister and who knew about which relationships when?

Anyone modeling intelligence data should find that a warm up exercise. 😉


Spark in Clojure

Tuesday, June 30th, 2015

Spark in Clojure by Mykhailo Kozik.

From the post:

Apache Spark is a fast and general engine for large-scale data processing.

100 times faster than Hadoop.

Everyone knows SQL. But traditional databases are not good in hadling big amount of data. Nevertheless, SQL is a good DSL for data processing and it is much easier to understand Spark if you have similar query implemented in SQL.

This article shows how common SQL queries implemented in Spark.

Another long holiday weekend appropriate posting.

Good big data practice too.

Erlang/OTP 18.0 has been released

Tuesday, June 30th, 2015

Erlang/OTP 18.0 has been released by Henrik.

From the post:

Erlang/OTP 18.0 is a new major release with new features, quite a few (characteristics) improvements, as well as some incompatibilities.
A non functional but major change this release is the change of license to APL 2.0 (Apache Public License).

Some highlights of the release are:

  • Starting from 18.0 Erlang/OTP is released under the APL 2.0 (Apache Public License)
  • erts: The time functionality has been extended. This includes a new API for
    time, as well as “time warp” modes which alters the behavior when system time changes. You are strongly encouraged to use the new API instead of the old API based on erlang:now/0. erlang:now/0 has been deprecated since it is a scalability bottleneck.
    For more information see the Time and Time Correction chapter of the ERTS User’s Guide. Here is a link
  • erts: Beside the API changes and time warp modes a lot of scalability and performance improvements regarding time management has been made. Examples are:

    • scheduler specific timer wheels,
    • scheduler specific BIF timer management,
    • parallel retrieval of monotonic time and system time on OS:es that support it.
  • erts: The previously introduced “eager check I/O” feature is now enabled by default.
  • erts/compiler: enhanced support for maps. Big maps new uses a HAMT (Hash Array Mapped Trie) representation internally which makes them more efficient. There is now also support for variables as map keys.
  • dialyzer: The -dialyzer() attribute can be used for suppressing warnings
    in a module by specifying functions or warning options.
    It can also be used for requesting warnings in a module.
  • ssl: Remove default support for SSL-3.0 and added padding check for TLS-1.0 due to the Poodle vulnerability.
  • ssl: Remove default support for RC4 cipher suites, as they are consider too weak.
  • stdlib: Allow maps for supervisor flags and child specs
  • stdlib: New functions in ets:

    • take/2. Works the same as ets:delete/2 but
      also returns the deleted object(s).
    • ets:update_counter/4 with a default object as

You can find the Release Notes with more detailed info at

A major holiday approaches in the United States (July 4th). A time when budget puffing terror alerts are issued, fatal automobile accidents surge, driving while intoxicated arrests jump, the usual marks of a US holiday.

If you spend some time with Erlang/OTP 18, you can greet your co-workers who survive the long weekend, albeit with frayed nerves from long proximity to family members and hangovers to boot, with some new tricks.

RStudio Cheatsheets

Tuesday, June 30th, 2015

RStudio Cheatsheets

RStudio, from whence so many good things for R come, has spreadsheets on:

  • Shiny Cheat Sheet (interactive web apps)
  • Data Visualization Cheat Sheet (ggplot2)
  • Package Development Cheat Sheet (devtools)
  • Data Wrangling Cheat Sheet (dplyr and tidyr)
  • R Markdown Cheat Sheet
  • R Markdown Reference Guide

And, all of the above are offered in Chinese, Dutch, French, German, and Spanish translations.

Have an R related cheatsheet about to burn a hole in your pocket? Or a high quality translation? RStudio is ready with details and templates at How to Contribute a Cheatsheet.


Interactive Data Visualization…

Tuesday, June 30th, 2015

Interactive Data Visualization using D3.js, DC.js, Nodejs and MongoDB by Anmol Koul.

From the post:

The aim behind this blog post is to introduce open source business intelligence technologies and explore data using open source technologies like D3.js, DC.js, Nodejs and MongoDB.

Over the span of this post we will see the importance of the various components that we are using and we will do some code based customization as well.

The Need for Visualization:

Visualization is the so called front-end of modern business intelligence systems. I have been around in quite a few big data architecture discussions and to my surprise i found that most of the discussions are focused on the backend components: the repository, the ingestion framework, the data mart, the ETL engine, the data pipelines and then some visualization.

I might be biased in favor of the visualization technologies as i have been working on them for a long time. Needless to say visualization is as important as any other component of a system. I hope most of you will agree with me on that. Visualization is instrumental in inferring the trends from the data, spotting outliers and making sense of the data-points.

What they say is right, A picture is indeed worth a thousand words.

The components of our analysis and their function:

D3.js: A javascript based visualization engine which will render interactive charts and graphs based on the data.

Dc.js: A javascript based wrapper library for D3.js which makes plotting the charts a lot easier.

Crossfilter.js: A javascript based data manipulation library. Works splendid with dc.js. Enables two way data binding.

Node JS: Our powerful server which serves data to the visualization engine and also hosts the webpages and javascript libraries.

Mongo DB: The resident No-SQL database which will serve as a fantastic data repository for our project.

[I added links to the components.]

A very useful walk through of interactive data visualization using open source tools.

It does require a time investment on your part but you will be richly rewarded with skills, ideas and new ways of thinking about visualizing your data.


The Life Cycle of Programming Languages

Tuesday, June 30th, 2015

The Life Cycle of Programming Languages by Betsy Haibel

I don’t know that you will agree with Betsy’s conclusion but it is an interesting read.

Fourteen years ago the authors of the Agile Manifesto said unto us: all technical problems are people problems that manifest technically. In doing so they repeated what Peopleware’s DeMarco and Lister had said fourteen years before that. We cannot break the endless cycle of broken frameworks and buggy software by pretending that broken, homogenous [sic] communities can produce frameworks that meet the varied needs of a broad developer base. We have known this for three decades.

The “homogeneous community” in question is, of course, white males.

I have no idea if the founders of the languages she mentions are all white males or not. But for purposes of argument, let’s say that the founding communities in question are exclusively white males. And intentionally so.

OK, where is the comparison case of language development that demonstrates a more gender, racial, sexual orientation, religious, inclusive group would produce less broken frameworks and less buggy software, but some specified measure?

I understand the point that frameworks and code are currently broken and buggy, no argument there. No need to repeat that or come up with new examples.

The question that interests me and I suspect would interest developers and customers alike, is where are the frameworks or code that is less buggy because they were created by more inclusive communities?

Inclusion will sell itself, quickly, if the case can be made that inclusive communities produce more useful frameworks or less buggy code.

In making the case for inclusion, citing studies that groups are more creative when diverse isn’t enough. Point to the better framework or less buggy code created by a diverse community. That should not be hard to do, assuming such evidence exists.

Make no mistake, I think discrimination on the basis of gender, race, sexual orientation, religion, etc. are not only illegal, they are immoral. However, the case for non-discrimination is harmed by speculative claims for improved results that are not based on facts.

Where are those facts? I would love to be able to cite them.

PS: Flames will be deleted. With others I fought gender/racial discrimination in organizing garment factories where the body heat of the workers was the only heat in the winter. Only to be betrayed by a union more interested in dues than justice for workers. Defeating discrimination requires facts, not rhetoric. (Recalling it was Brown vs. Board of Education that pioneered the use of social studies data in education litigation. They offered facts, not opinions.)

Dataset Usage Vocabulary

Tuesday, June 30th, 2015

Dataset Usage Vocabulary W3C First Public Working Draft 25 June 2015.


Datasets published on the Web are accessed and experienced by consumers in a variety of ways, but little information about these experiences is typically conveyed. Dataset publishers many times lack feedback from consumers about how datasets are used. Consumers lack an effective way to discuss experiences with fellow collaborators and explore referencing material citing the dataset. Datasets as defined by DCAT are a collection of data, published or curated by a single agent, and available for access or download in one or more formats. The Dataset Usage Vocabulary (DUV) is used to describe consumer experiences, citations, and feedback about the dataset from the human perspective.

By specifying a number of foundational concepts used to collect dataset consumer feedback, experiences, and cite references associated with a dataset, APIs can be written to support collaboration across the Web by structurally publishing consumer opinions and experiences, and provide a means for data consumers and producers advertise and search for published open dataset usage.

From Status of This Document:

This is a draft document which may be merged with the Data Quality Vocabulary or remain as a standalone document. Feedback is sought on the overall direction being taken as much as the specific details of the proposed vocabulary.

Comments to:, to subscribe comment Archives.

This could be useful. Especially if specialized vocabularies are developed from experience in particular data domains.

XML Inclusions (XInclude) Version 1.1

Tuesday, June 30th, 2015

XML Inclusions (XInclude) Version 1.1 W3C Candidate Recommendation 30 June 2015.

Will not exit CR before 25 August 2015.

Comments to:, comment Archives.


This document specifies a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset. Specification of the XML documents (infosets) to be merged and control over the merging process is expressed in XML-friendly syntax (elements, attributes, URI references).

The promise of XML of dynamic documents, composed from data stores, other documents, etc., does happen, but not nearly as frequently as it should.

Looking for XML Inclusions to be another step away from documents as static containers.

Perceptual feature-based song genre classification using RANSAC [Published?]

Tuesday, June 30th, 2015

Perceptual feature-based song genre classification using RANSAC by Arijit Ghosal; Rudrasis Chakraborty; Bibhas Chandra Dhara; Sanjoy Kumar Saha. International Journal of Computational Intelligence Studies (IJCISTUDIES), Vol. 4, No. 1, 2015.


In the context of a content-based music retrieval system or archiving digital audio data, genre-based classification of song may serve as a fundamental step. In the earlier attempts, researchers have described the song content by a combination of different types of features. Such features include various frequency and time domain descriptors depicting the signal aspects. Perceptual aspects also have been combined along with. A listener perceives a song mostly in terms of its tempo (rhythm), periodicity, pitch and their variation and based on those recognises the genre of the song. Motivated by this observation, in this work, instead of dealing with wide range of features we have focused only on the perceptual aspect like melody and rhythm. In order to do so audio content is described based on pitch, tempo, amplitude variation pattern and periodicity. Dimensionality of descriptor vector is reduced and finally, random sample and consensus (RANSAC) is used as the classifier. Experimental result indicates the effectiveness of the proposed scheme.

A new approach to classification of music, but that’s all I can say since the content is behind a pay-wall.

One way to increase the accessibility of texts would be for tenure committees to not consider publications as “published” until they are freely available for the author’s webpage.

That one change could encourage authors to press for the right to post their own materials and to follow through with posting them as soon as possible.

Feel free to forward this post to members of your local tenure committee.

The $1 Trillion Lockheed Martin F-35 Flying Coffin

Tuesday, June 30th, 2015

I have posted on security issues with the F-35 aircraft under Have You Ever Pwned an F-35? and about its tendency to catch on fire, spontaneously, under Pwning F-35 – Safety Alert.

Today I read Test Pilot Admits the F-35 Can’t Dogfight: New stealth fighter is dead meat in an air battle by David Axe.

From David’s post:

A test pilot has some very, very bad news about the F-35 Joint Strike Fighter. The pricey new stealth jet can’t turn or climb fast enough to hit an enemy plane during a dogfight or to dodge the enemy’s own gunfire, the pilot reported following a day of mock air battles back in January.

“The F-35 was at a distinct energy disadvantage,” the unnamed pilot wrote in a scathing five-page brief that War Is Boring has obtained. The brief is unclassified but is labeled “for official use only.”

The test pilot’s report is the latest evidence of fundamental problems with the design of the F-35 — which, at a total program cost of more than a trillion dollars, is history’s most expensive weapon.

The U.S. Air Force, Navy and Marine Corps — not to mention the air forces and navies of more than a dozen U.S. allies — are counting on the Lockheed Martin-made JSF to replace many if not most of their current fighter jets.

And that means that, within a few decades, American and allied aviators will fly into battle in an inferior fighter — one that could get them killed … and cost the United States control of the air.

A close friend recently said that I shouldn’t complain about vendors making money off of the government in return for little or no useful goods or services. He called it, “…breaking their rice bowls….”

Perhaps so but the result of thousands, if not hundreds of thousands, of people not speaking up when the government is billed for little or no useful goods or services is the $1 Trillion Lockheed Martin F-35 Flying Coffin.

Not only do such projects damage the military capability of the United States, it also degrades the military forces of every country that buys one of these buggy, flammable and easy-to-defeat aircraft.

I’m sure it can stand off and fire missiles with great accuracy, but so can a land-based cruise missile launcher. For a lot less money.

Foreign countries should be rushing to cancel orders for the Lockheed Martin F-35 Flying Coffin and invest in innovative military solutions. Highly sophisticated missile systems designed to degrade aircraft delivery platforms for example. Or electronic warfare and anti-aircraft missile defenses.

Streaming Data IO in R

Monday, June 29th, 2015

Streaming Data IO in R – curl, jsonlite, mongolite by Jeroem Ooms.


The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

R, JSON, MongoDB, what’s there not to like? 😉

From UseR! 2015.


ChemistryWorld Podcasts: Compounds (Phosgene)

Monday, June 29th, 2015

Chemistry in its elements: Compounds is a weekly podcast sponsored by ChemistryWorld, which features a chemical compound or group of compounds every week.

Matthew Gunter has a podcast entitled: Phosgene.

In case your recent history is a bit rusty, phosgene was one of the terror weapons of World War I. It accounted for 85% of the 100,000 deaths from chemical gas. Not as effective as say sarin but no slouch.

Don’t run to the library, online guides or the FBI for recipes to make phosgene at home. Its use in industrial applications should give you a clue as to an alternative to home-made phosgene. Use of phosgene violates the laws of war, so being a thief as well should not trouble you.

No, I don’t have a list of locations that make or use phosgene, but then DHS probably doesn’t either. They are more concerned with terrorists using “nuclear weapons” or “gamma-ray bursts“. One is mechanically and technically difficult to do well and the other is impossible to control.

The idea of someone using a dual-wheel pickup and a plant pass to pickup and deliver phosgene gas is too simple to have occurred to them.

If you are pitching topic maps to a science/chemistry oriented audience, these podcasts make a nice starting point for expansion. To date there are two hundred and forty-two (242) of them.


A Critical Review of Recurrent Neural Networks for Sequence Learning

Monday, June 29th, 2015

A Critical Review of Recurrent Neural Networks for Sequence Learning by Zachary C. Lipton.


Countless learning tasks require awareness of time. Image captioning, speech synthesis, and video game playing all require that a model generate sequences of outputs. In other domains, such as time series prediction, video analysis, and music information retrieval, a model must learn from sequences of inputs. Significantly more interactive tasks, such as natural language translation, engaging in dialogue, and robotic control, often demand both.

Recurrent neural networks (RNNs) are a powerful family of connectionist models that capture time dynamics via cycles in the graph. Unlike feedforward neural networks, recurrent networks can process examples one at a time, retaining a state, or memory, that reflects an arbitrarily long context window. While these networks have long been difficult to train and often contain millions of parameters, recent advances in network architectures, optimization techniques, and parallel computation have enabled large-scale learning with recurrent nets.

Over the past few years, systems based on state of the art long short-term memory (LSTM) and bidirectional recurrent neural network (BRNN) architectures have demonstrated record-setting performance on tasks as varied as image captioning, language translation, and handwriting recognition. In this review of the literature we synthesize the body of research that over the past three decades has yielded and reduced to practice these powerful models. When appropriate, we reconcile conflicting notation and nomenclature. Our goal is to provide a mostly self-contained explication of state of the art systems, together with a historical perspective and ample references to the primary research.

Lipton begins with an all too common lament:

The literature on recurrent neural networks can seem impenetrable to the uninitiated. Shorter papers assume familiarity with a large body of background literature. Diagrams are frequently underspecified, failing to indicate which edges span time steps and which don’t. Worse, jargon abounds while notation is frequently inconsistent across papers or overloaded within papers. Readers are frequently in the unenviable position of having to synthesize conflicting information across many papers in order to understand but one. For example, in many papers subscripts index both nodes and time steps. In others, h simultaneously stands for link functions and a layer of hidden nodes. The variable t simultaneously stands for both time indices and targets, sometimes in the same equation. Many terrific breakthrough papers have appeared recently, but clear reviews of recurrent neural network literature are rare.

Unfortunately, Lipton gives no pointers to where the variant practices occur, leaving the reader forewarned but not forearmed.

Still, this is a survey paper with seventy-three (73) references over thirty-three (33) pages, so I assume you will encounter various notation practices if you follow the references and current literature.

Capturing variations in notation, along with where they have been seen, won’t win the Turing Award but may improve the CS field overall.

BBC News Labs (and other news labs)

Monday, June 29th, 2015

BBC News Labs

I saw a tweet from the BBC News Labs saying:

We were News labs before it was cool.

cc @googlenewslab

Which was followed by this lively exchange:


From the about page:

This Jekyll-powered blog is pitched at interested Journalists, Technologists and Hacker Journalists, and provides regular updates on News Labs' activities.

We hope it will open new opportunities for collaborative work, by attracting attention from like-minded people in this space.

You can still find our major updates pitched at a broader audience here on the BBC Internet Blog.

About BBC News Labs

BBC News Labs is an incubator powered by BBC Connected Studio, and is charged with driving innovation for BBC News.

Our M.O.

We work as a multi-discipline incubator, exploring scalable opportunities at the intersection of:

  1. Journalism
  2. Technology
  3. Data

Our goals

  1. Harness BBC talent & creativity to drive Innovation
  2. Open new opportunities for Story-driven Journalism
  3. Support Innovation Transfer into Production
  4. Drive open standards through News Industry collaboration
  5. Raise BBC News’ Profile as an Innovator

You can find out more on the BBC News Labs corporate website here

Get in touch

We'd be delighted to hear from you or to see if you can contribute to one of our projects. Give us a shout at:

News Labs Links

For your Twitter following pleasure, news labs mentioned in this post:







Other news labs that should be added to this list?

PS: I would include @Journalism2ls. – Journalism Tools in a more general list.

Update: @NeimanLab: Nieman Journalism Lab at Harvard.

More Analytics Needed in Cyberdefense: [The first step towards cybersecurity is…]

Sunday, June 28th, 2015

More Analytics Needed in Cyberdefense by David Stegon.

Before you credit this report too much, consider the following points:

Crunching the Survey Numbers

MeriTalk, on behalf of Splunk, conducted an online survey of 150 Federal and 152 State and Local cyber security pros in March 2015. The report has a margin of error of ±5.6% at a 95% confidence level. (slide 15)

Federal Computer Week has 80,057 subscribers and approximately 21% of them are Senior IT Management. Federal Computer Week (FCW)

That’s 16,812 of the subscriber total and MeriTalk captured opinions from 150 “cyber security pros.”

Roughly that means that MeriTalk obtained opinions from the equivalent of 0.009% of the senior IT management subscribers to Federal Computer Week.

A survey of less than 0.009% of cyber security pros doesn’t fill me with confidence about these survey “results.”

Big Data analytics for Cyberdefense

In addition to being a tiny portion of “cyber security pros,” you have to wonder what “big data” the respondents thought would be analyzed?

OPM wasn’t running any logging on its servers! (The Absence of Proof Against China on OPM Hacks)

Care to wager that other federal agencies and contractors are not running logging on their networks? I didn’t think so.

Big data techniques, properly understood and applied can lead to valuable insights for cybersecurity. But note the qualifiers, “properly understood and applied…”

The first step towards cybersecurity is recognizing when vendors are taking your money and not improving your IT security.

Medical Sieve [Information Sieve]

Sunday, June 28th, 2015

Medical Sieve

An effort to capture anomalies from medical imaging, package those with other data, and deliver it for use by clinicians.

If you think of each medical image as represented a large amount of data, the underlying idea is to filter out all but the most relevant data, so that clinicians are not confronting an overload of information.

In network terms, rather than displaying all of the current connections to a network (the ever popular eye-candy view of connections), displaying only those connections that are different from all the rest.

The same technique could be usefully applied in a number of “big data” areas.

From the post:

Medical Sieve is an ambitious long-term exploratory grand challenge project to build a next generation cognitive assistant with advanced multimodal analytics, clinical knowledge and reasoning capabilities that is qualified to assist in clinical decision making in radiology and cardiology. It will exhibit a deep understanding of diseases and their interpretation in multiple modalities (X-ray, Ultrasound, CT, MRI, PET, Clinical text) covering various radiology and cardiology specialties. The project aims at producing a sieve that filters essential clinical and diagnostic imaging information to form anomaly-driven summaries and recommendations that tremendously reduce the viewing load of clinicians without negatively impacting diagnosis.

Statistics show that eye fatigue is a common problem with radiologists as they visually examine a large number of images per day. An emergency room radiologist may look at as many 200 cases a day, and some of these imaging studies, particulary lower body CT angiography can be as many as 3000 images per study. Due to the volume overload, and limited amount of clinical information available as part of imaging studies, diagnosis errors, particularly relating to conincidental diagnosis cases can occur. With radiologists also being a scarce resource in many countries, it will even more important to reduce the volume of data to be seen by clinicians particularly, when they have to be sent over low bandwidth teleradiology networks.

MedicalSieve is an image-guided informatics system that acts as a medical sieve filtering the essential clinical information physicians need to know about the patient for diagnosis and treatment planning. The system gathers clinical data about the patient from a variety of enterprise systems in hospitals including EMR, pharmacy, labs, ADT, and radiology/cardiology PACs systems using HL7 and DICOM adapters. It then uses sophisticated medical text and image processing, pattern recognition and machine learning techniques guided by advanced clinical knowledge to process clinical data about the patient to extract meaningful summaries indicating the anomalies. Finally, it creates advanced summaries of imaging studies capturing the salient anomalies detected in various viewpoints.

Medical Sieve is leading the way in diagnostic interpretation of medical imaging datasets guided by clinical knowledge with many first-time inventions including (a) the first fully automatic spatio-temporal coronary stenosis detection and localization from 2D X-ray angiography studies, (b) novel methods for highly accurate benign/malignant discrimination in breast imaging, and (c) first automated production of AHA guideline17 segment model for cardiac MRI diagnosis.

For more details on the project, please contact Tanveer Syeda-Mahmood (>

You can watch a demo of our Medical Sieve Cognitive Assistant Application here.

Curious: How would you specify the exclusions of information? So that you could replicate the “filtered” view of the data?

Replication is a major issue in publicly funded research these days. Not reason for that to be any different for data science.


Domain Modeling: Choose your tools

Sunday, June 28th, 2015

Kirk Borne posted to Twitter:

Great analogy by @wcukierski at #GEOINT2015 on #DataScience Domain Modeling > bulldozers: toy model versus the real thing.



Does your tool adapt to the data? (The real bulldozer above.)

Or, do you adapt your data to the tool? (The toy bulldozer above.)

No, I’m not going there. That is like a “the best editor” flame war. You have to decide that question for yourself and your project.

Good luck!

The Week’s Most Popular Data Journalism Links [June 22nd]

Sunday, June 28th, 2015

Top Ten #ddj: The Week’s Most Popular Data Journalism Links by GIJN Staff and Connected Action.

From the post:

What’s the data-driven journalism crowd tweeting? Here are the Top Ten links for Jun 11-18: mapping global tax evasion (@grandjeanmartin), vote for best data journalism site (@GENinnovate); data viz examples (@visualoop, @OKFN), data retention (@Frontal21) and more.

A number of compelling visualizations and in particular: SwissLeaks: the map of the globalized tax evasion. Imaginative visualization of countries but not with the typical global map.

A great first step but I don’t find country level visualizations (or agency level accountability) all that compelling. There is $X amount of tax avoidance in country Y but that lacks the impact of naming the people who are evading the taxes, perhaps along with a photo for the society pages and their current location.

BTW, you should start following #ddj on Twitter.

New York Philharmonic Performance History

Sunday, June 28th, 2015

New York Philharmonic Performance History

From the post:

The New York Philharmonic played its first concert on December 7, 1842. Since then, it has merged with the New York Symphony, the New/National Symphony, and had a long-running summer season at New York’s Lewisohn Stadium. This Performance History database documents all known concerts of all of these organizations, amounting to more than 20,000 performances. The New York Philharmonic Leon Levy Digital Archives provides an additional interface for searching printed programs alongside other digitized items such as marked music scores, marked orchestral parts, business records, and photos.

In an effort to make this data available for study, analysis, and reuse, the New York Philharmonic joins organizations like The Tate and the Cooper-Hewitt Smithsonian National Design Museum in making its own contribution to the Open Data movement.

The metadata here is released under the Creative Commons Public Domain CC0 licence. Please see the enclosed LICENCE file for more detail.

The data:

Field Description
General Info: Info that applies to entire program
id GUID (To view program:
ProgramID Local NYP ID
Orchestra Full orchestra name Learn more…
Season Defined as Sep 1 – Aug 31, displayed “1842-43”
Concert Info: Repeated for each individual performance within a program
eventType See term definitions
Location Geographic location of concert (Countries are identified by their current name. For example, even though the orchestra played in Czechoslovakia, it is now identified in the data as the Czech Republic)
Venue Name of hall, theater, or building where the concert took place
Date Full ISO date used, but ignore TIME part (1842-12-07T05:00:00Z = Dec. 7, 1842)
Time Actual time of concert, e.g. “8:00PM”
Works Info: the fields below are repeated for each work performed on a program. By matching the index number of each field, you can tell which soloist(s) and conductor(s) performed a specific work on each of the concerts listed above.
worksConductorName Last name, first name
worksComposerTitle Composer Last name, first / TITLE (NYP short titles used)
worksSoloistName Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistInstrument Last name, first name (if multiple soloists on a single work, delimited by semicolon)
worksSoloistRole “S” means “Soloist”; “A” means “Assisting Artist” (if multiple soloists on a single work, delimited by semicolon)

A great starting place for a topic map for performances of the New York Philharmonic or for combination with topic maps for composers or soloists.

I first saw this in a tweet by Anna Kijas.

The Absence of Proof Against China on OPM Hacks

Saturday, June 27th, 2015

The Obama Administration has failed to release any evidence connecting China to the OPM hacks.

Now we know why: Hacked OPM and Background Check Contractors Lacked Logs, DHS Says.

From the post:

Tracking everyday network traffic requires an investment and some managers decide the expense outweighs the risk of a breach going undetected, security experts say.

In this case, taking chances has delayed a probe into the exposure of secrets on potentially 18 million national security personnel.

Hopefully congressional hearings will expand “some managers” into a list of identified individuals.

That is a level of incompetence that verges on the criminal.

Not having accountability for government employees has not lead to a secure IT infrastructure. Time to try something new. Like holding all employees accountable for their incompetence.

Running Lisp in Production

Saturday, June 27th, 2015

Running Lisp in Production by Vsevolod Dyomkin and Kevin McIntire.

From the post:

At Grammarly, the foundation of our business, our core grammar engine, is written in Common Lisp. It currently processes more than a thousand sentences per second, is horizontally scalable, and has reliably served in production for almost 3 years.

We noticed that there are very few, if any, accounts of how to deploy Lisp software to modern cloud infrastructure, so we thought that it would be a good idea to share our experience. The Lisp runtime and programming environment provides several unique, albeit obscure, capabilities to support production systems (for the impatient, they are described in the final chapter).

An inspirational story about Lisp, along with tips on features you are unlikely to find elsewhere. A good read and worth the time.

Since the OPM is still running COBOL, I am sure one of your favorite agencies is still crunching Lisp. You might need to get them to upgrade.

Linked Data Repair and Certification

Saturday, June 27th, 2015

1st International Workshop on Linked Data Repair and Certification (ReCert 2015) is a half-day workshop at the 8th International Conference on Knowledge Capture (K-CAP 2015).

I know, not nearly as interesting as talking about Raquel Welch, but someone has to. 😉

From the post:

In recent years, we have witnessed a big growth of the Web of Data due to the enthusiasm shown by research scholars, public sector institutions and some private companies. Nevertheless, no rigorous processes for creating or mapping data have been systematically followed in most cases, leading to uneven quality among the different datasets available. Though low quality datasets might be adequate in some cases, these gaps in quality in different datasets sometimes hinder the effective exploitation, especially in industrial and production settings.

In this context, there are ongoing efforts in the Linked Data community to define the different quality dimensions and metrics to develop quality assessment frameworks. These initiatives have mostly focused on spotting errors as part of independent research efforts, sometimes lacking a global vision. Further, up to date, no significant attention has been paid to the automatic or semi-automatic repair of Linked Data, i.e., the use of unattended algorithms or supervised procedures for the correction of errors in linked data. Repaired data is susceptible of receiving a certification stamp, which together with reputation metrics of the sources can lead to having trusted linked data sources.

The goal of the Workshop on Linked Data Repair and Certification is to raise the awareness of dataset repair and certification techniques for Linked Data and to promote approaches to assess, monitor, maintain, improve, and certify Linked Data quality.

There is a call for papers with the following deadlines:

Paper submission: Monday, July 20, 2015

Acceptance Notification: Monday August 3, 2015

Camera-ready version: Monday August 10, 2015

Workshop: Monday October 7, 2015

Now that linked data exists, someone has to undertake the task of maintaining it. You could make links in linked data into topics in a topic map and add properties that would make them easier to match and maintain. Just a thought.

As far as “trusted link data sources,” I think the correct phrasing is: “less untrusted data sources than others.”

You know the phrase: “In God we trust, all others pay cash.”

Same is true for data. It may be a “trusted” source, but verify the data first, then trust.

Subjects For Less Obscure Topic Maps?

Saturday, June 27th, 2015

A new window into our world with real-time trends

From the post:

Every journey we take on the web is unique. Yet looked at together, the questions and topics we search for can tell us a great deal about who we are and what we care about. That’s why today we’re announcing the biggest expansion of Google Trends since 2012. You can now find real-time data on everything from the FIFA scandal to Donald Trump’s presidential campaign kick-off, and get a sense of what stories people are searching for. Many of these changes are based on feedback we’ve collected through conversations with hundreds of journalists and others around the world—so whether you’re a reporter, a researcher, or an armchair trend-tracker, the new site gives you a faster, deeper and more comprehensive view of our world through the lens of Google Search.

Real-time data

You can now explore minute-by-minute, real-time data behind the more than 100 billion searches that take place on Google every month, getting deeper into the topics you care about. During major events like the Oscars or the NBA Finals, you’ll be able to track the stories most people are searching for and where in the world interest is peaking. Explore this data by selecting any time range in the last week from the date picker.

Follow @GoogleTrends for tweets about new data sets and trends.

See GoogleTrends at:

This has been in browser tab for several days. I could not decide if it was eye candy or something more serious.

After all, we are talking about searches ranging from experts to the vulgar.

I went an visited today’s results at Google Trends, and found:

  • 5 Crater of Diamonds State Park, Arkansas
  • 17 Ted 2, Jurassic World
  • 22 World’s Ugliest Dog Contest [It doesn’t say if Trump entered or not.]
  • 35 Episcopal Church
  • 48 Grace Lee Boggs
  • 59 Raquel Welch
  • 68 Dodge, Mopar, Dodge Challenger
  • 79 Xbox One, Xbox, Television
  • 86 Escobar: Paradise Lost, Pablo Escobar, Benicio del Toro
  • 98 Islamic State of Iraq and the Levant

I was glad to see Raquel Welch was in the top 100 but saddened that she was out scored by the Episcopal Church. That has to sting.

When I think of topic maps that I can give you as examples, they involve taxes, Castrati, and other obscure topics. My favorite use case is an ancient text annotated with commentaries and comparative linguistics based on languages no longer spoken.

I know what interests me but not what interests other people.

Thoughts on using Google Trends to pick “hot” topics for topic mapping?

Celebrity Porn Alert!

Saturday, June 27th, 2015

A security blog I was reading mentioned that when named celebrity porn leaks, the number of searches for the named celebrity and nude, etc. jumps.

The blog also pointed out that infectious sites rapidly adapt to be in the top “hits” for such searches.

Not only do you run the risk of being discovered looking for celebrity porn, you may get an infection as well.

I wonder if you could trap CIA operatives by claiming to have compromising photos of Putin? 😉

Is there a startup opportunity here? Safe celebrity porn?

FBI Builds Silencers For The Mentally Ill

Friday, June 26th, 2015

North Carolina Man Charged with Attempting to Provide Material Support to ISIL and Weapon Offenses

If you read the press release, you will miss these goodies from the complaint:

28. The FBI built a functional silencer at Sullivan’s request. That silencer does not bear the required serial number,7 and is not registered to Sullivan or any person in the National Firearms Registration and Transfer Record.

29. The FBI sent a package constaining the silencer to Sullivan’s home at 5470 Rose Carswell Road, Morganton, North Carolina, according to Sullivan’s instructions. At approximately 4:15 p.m. on June 19, 2015, Sullivan’s mother picked up the mail, to include the package containing the silencer, from the mailbox and returned to the house. FBI surveillance confirmed Sullivan was in the house when his mother entered with the silencer.

30. On June 19, 2015, the FBI conducted a search of 5470 Carswell Road, Morganton, North Carolina, pursuant to the consent of Sullivan’s mother and a federal search warrant. Among other things, the FBI found the silencer delivered to Sullivan earlier that day, which was hidden under plastic in a crawlspace accessible from the basement of the home….

How did all this start?

10. On April 21, 2015, Sullivan’s father placed a “911” call to request police assistance at the family residence at 5470 Rose Carswell Road, Morganton, North Carolina. Sullivan’s father said: “I don’t know if it is ISIS or what, but he [Sullivan] is destroying Buddhas, and figurines and stuff.” He stated that Sullivan was destroying their “religious” items, had done so before, and this time Sullivan poured gasoline on some such items to burn them. Sullivan’s father added: “I mean, we are scared to leave the house.” Sullivan could be heard in the background stating: “why are you trying to say I am a terrorist?” and words to that effect, multiple times. Sullivan complained in the background that his father was only mentioning the religious items, and asked his father to tell the police he had destroyed other objects as well. Sullivan could be heard stating that “they” were going to put Sullivan “in jail my whole life,” or, alternatively: “they are not going to put me in jail. They are going to kill me.”

Of course, rather than a referral to mental health services, a FBI undercover agent made contact with Sullivan on June 6, 2015. You can read the recounting of the bizarre conversations with Sullivan in the complaint. It is an image file so I have to re-type anything that appears in the blog.

According to the news release Sullivan was charged with:

one count of attempting to provide material support to ISIL,

one count of transporting and receiving a silencer in interstate commerce with intent to commit a felony, and

one count of receipt and possession of an unregistered silencer, unidentified by a serial number.

True enough, a person disturbed enough to:

Sullivan complained in the background that his father was only mentioning the religious items, and asked his father to tell the police he had destroyed other objects as well.

How’s that for an answer to the complaint you are destroying religious items? You want to point out to the police you are destroying other stuff too?

Sullivan was suffering from paranoid delusions but rather than getting him help, the FBI set him up for being charged with attempting to assist ISIS and two silencer violations that occurred only because the FBI built and mailed him a silencer.

Victimizing the mentally ill pads the FBI terrorist statistics and serves to further the fictional war on terrorism.

DuckDuckGo search traffic soars 600% post-Snowden

Friday, June 26th, 2015

DuckDuckGo search traffic soars 600% post-Snowden by Lee Munson.

From the post:

When Gabriel Weinberg launched a new search engine in 2008 I doubt even he thought it would gain any traction in an online world dominated by Google.

Now, seven years on, Philadelphia-based startup DuckDuckGo – a search engine that launched with a promise to respect user privacy – has seen a massive increase in traffic, thanks largely to ex-NSA contractor Edward Snowden’s revelations.

Since Snowden began dumping documents two years ago, DuckDuckGo has seen a 600% increase in traffic (but not in China – just like its larger brethren, its blocked there), thanks largely to its unique selling point of not recording any information about its users or their previous searches.

Such a huge rise in traffic means DuckDuckGo now handles around 3 billion searches per year.

DuckDuckGo does not track its users. Instead, it makes money off of displaying key word (from your search string) based ads.

Hmmm, what if instead of key words from your search string, you pre-qualified yourself for ads?

Say for example I have a topic map fragment that pre-qualifies me for new books on computer science, break baking, and waxed dental floss. When I use a search site, it uses those “topics” or key words to display ads to me.

That avoids displaying to me ads for new cars (don’t own one, don’t want one), hair replacement ads (not interested) and ski resorts (don’t ski).

Advertisers benefit because their ads are displayed to people who have qualified themselves as interested in their products. I don’t know what the difference in click-through rate would be but I suspect it would be substantial.


Top 10 data mining algorithms in plain R

Friday, June 26th, 2015

Top 10 data mining algorithms in plain R by Raymond Li.

From the post:

Knowing the top 10 most influential data mining algorithms is awesome.

Knowing how to USE the top 10 data mining algorithms in R is even more awesome.

That’s when you can slap a big ol’ “S” on your chest…

…because you’ll be unstoppable!

Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

By the end of this post…

You’ll have 10 insanely actionable data mining superpowers that you’ll be able to use right away.

The table of contents follows his Top 10 data mining algorithms in plain English, with additions for R:

I would not be at all surprised to see these top ten (10) algorithms show up in other popular data mining languages.


BBC Pages Censored by the EU

Friday, June 26th, 2015

List of BBC web pages which have been removed from Google’s search results by Neil McIntosh.

From the post:

Since a European Court of Justice ruling last year, individuals have the right to request that search engines remove certain web pages from their search results. Those pages usually contain personal information about individuals.

Following the ruling, Google removed a large number of links from its search results, including some to BBC web pages, and continues to delist pages from BBC Online.

The BBC has decided to make clear to licence fee payers which pages have been removed from Google’s search results by publishing this list of links. Each month, we’ll republish this list with new removals added at the top.

We are doing this primarily as a contribution to public policy. We think it is important that those with an interest in the “right to be forgotten” can ascertain which articles have been affected by the ruling. We hope it will contribute to the debate about this issue. We also think the integrity of the BBC’s online archive is important and, although the pages concerned remain published on BBC Online, removal from Google searches makes parts of that archive harder to find.

The pages affected by delinking may disappear from Google searches, but they do still exist on BBC Online. David Jordan, the BBC’s Director of Editorial Policy and Standards, has written a blog post which explains how we view that archive as “a matter of historic public record” and, thus, something we alter only in exceptional circumstances. The BBC’s rules on deleting content from BBC Online are strict; in general, unless content is specifically made available only for a limited time, the assumption is that what we publish on BBC Online will become part of a permanently accessible archive. To do anything else risks reducing transparency and damaging trust.

Kudos for the BBC for demonstrating the extent of censorship implied by the EU’s “right to be forgotten. The “right to be forgotten” combines ignorance of technology with eurocentrism at its very worst. Not to mention being futile when directed at a search engine.

Just to get you started, here are the links from the post:

One caveat: when looking through this list it is worth noting that we are not told who has requested the delisting, and we should not leap to conclusions as to who is responsible. The request may not have come from the obvious subject of a story.

May 2015

April 2015

March 2015

February 2015

January 2015

December 2014

November 2014

October 2014

September 2014

August 2014

July 2014

One consequence of this listing is that I will have to follow the BBC blog to catch the new list of deletions, month by month. The writing is always enjoyable but it’s one more thing to track.

The thought does occur to me that analysis of the EU censored pages may reveal patterns of what materials are the most likely subjects of censorship.

In addition to the BBC list, one can imagine a search engine that only indexes EU censored pages. Would ad revenue sustain such an index or would it be pay-per-view?

It would be very ironic if EU censorship resulted in more publicity for people exercising their “right to be forgotten.” Not only ironic, but appropriate at well.

PS: You can follow the BBC Internet Blog on Twitter: @bbcinternetblog.

Topic Maps For Sharing (or NOT!)

Friday, June 26th, 2015

This is one slide (#38) out of several but I saw it posted by PBBsRealm(Brad M) and thought it was worth transcribing part of it:

From the slide:

Why is Cyber Security so Hard?

No common taxonomy

  • Information is power; sharing is seen as loss of power

[Searching on several phrases and NERC (North American Electricity Reliability Corporation), I have been unable to find the entire slide deck.]

Did you catch the line:

Information is power; sharing is seen as loss of power

You can use topic maps for sharing, but how much sharing you choose to do is up to you.

For example, assume your department is responsible for mapping data for ETL operations. Each analyst is using state of the art software to create mappings from field to field. In the process of creating those mappings, each analyst learns enough about those fields to make sure the mapping is correct.

Now one or more of your analysts leave for other positions. All the ad hoc knowledge they had of the data fields has been lost. With a topic map, you could have been accumulating power as each analyst discovered information about each data field.

If management requests the mapping you are using, you output the standard field to field mapping, with none of the extra information that you have accumulated for each field in a topic map. The underlying descriptions remain solely in your possession.

With topic maps, you can share a little or a lot, your call.

PS: You can also encrypt the values you use for merging in your topic map. Which could enable different levels of merging for one map, based upon a level of security clearance. An example would be a topic map resource accessible by people with varying security clearances. (CIA/NSA take note.)